The rule in a particular game is that a character's power is proportional to the triangular root of the character's experience. For example, 15-20 experience gives 5 power, 21-27 experience gives 6 power, 28-35 experience gives 7 power, etc. Some players are known to have achieved experience in the hundreds of billions.
I am trying to implement this game on an 8-bit machine that has only three arithmetic instructions: add, subtract, and divide by 2. For example, to multiply a number by 4, a program would add it to itself twice. General multiplication is much slower; I've written a software subroutine to do it using a quarter-square table.
I had considered calculating the triangular root T(p) through bisection search for the successive triangular numbers bounding an experience number from above and below. My plan was to use a recurrence identity for T(2*p) until it exceeds experience, then use that as the upper bound for a bisection search. But I'm having trouble finding an identity for T((x+y)/2) in the bisection that doesn't use either x*y or (x+y)^2.
Is there an efficient algorithm to calculate the triangular root of a number with just add, subtract, and halve? Or will I end up having to perform O(log n) multiplications, one to calculate each midpoint in the bisection search? Or would it be better to consider implementing long division to use Newton's method?
Definition of T(x):
T(x) = (n * (n + 1))/2
Identities that I derived:
T(2*x) = 4*T(x) - x
# e.g. T(5) = 15, T(10) = 4*15 - 5 = 55
T(x/2) = (T(x) + x/2)/4
# e.g. T(10) = 55, T(5) = (55 + 5)/4 = 15
T(x + y) = T(x) + T(y) + x*y
# e.g. T(3) = 6, T(7) = 28, T(10) = 6 + 28 + 21 = 55
T((x + y)/2) = (T(x) + T(y) + x*y + (x + y)/2)/4
# e.g. T(3) = 6, T(7) = 28, T(5) = (6 + 28 + 21 + 10/2)/4 = 15
Do bisection search, but make sure that y - x is always a power of two. (This does not increase the asymptotic running time.) Then T((x + y) / 2) = T(x) + T(h) + x * h, where h is a power of two, so x * h is computable with a shift.
Here's a Python proof of concept (hastily written, more or less unoptimized but avoids expensive operations).
def tri(n):
return ((n * (n + 1)) >> 1)
def triroot(t):
y = 1
ty = 1
# Find a starting point for bisection search by doubling y using
# the identity T(2*y) = 4*T(y) - y. Stop when T(y) exceeds t.
# At the end, x = 2*y, tx = T(x), and ty = T(y).
while (ty <= t):
assert (ty == tri(y))
tx = ty
ty += ty
ty += ty
ty -= y
x = y
y += y
# Now do bisection search on the interval [x .. x + h),
# using these identities:
# T(x + h) = T(x) + T(h) + x*h
# T(h/2) = (T(h) + h/2)/4
th = tx
h = x
x_times_h = ((tx + tx) - x)
while True:
assert (tx == tri(x))
assert (x_times_h == (x * h))
# Divide h by 2
h >>= 1
x_times_h >>= 1
if (not h):
break
th += h
th >>= 1
th >>= 1
# Calculate the midpoint of the search interval
tz = ((tx + th) + x_times_h)
z = (x + h)
assert (tz == tri(z))
# If the midpoint is below the target, move the lower bound
# of the search interval up to the midpoint
if (t >= tz):
tx = tz
x = z
x_times_h += ((th + th) - h)
return x
for q in range(1, 100):
p = triroot(q)
assert (tri(p) <= q < tri((p + 1)))
print(q, p)
As observed in the linked page on math.stackexchange.com there is a direct formula for the solution of this problem and being x = n*(n+1)/2 then the reverse is:
n = (sqrt(1+8*x) - 1)/2
Now there is the square root and other things but I would suggest to use this direct formula with an implementation like the following:
tmp = x + x; '2*x
tmp += tmp; '4*x
tmp += tmp + 1; '8*x + 1
n = 0;
n2 = 0;
while(n2 <= tmp){
n2 += n + n + 1; 'remember that (n+1)^2 - n^2 = 2*n + 1
n++;
}
'here after the loops n = floor(sqrt(8*x+1)) + 1
n -= 2; 'floor(sqrt(8*x+1)) - 1
n /= 2; '(floor(sqrt(8*x+1)) - 1) / 2
Of course this can be improved for better performances if neede like considering that integer values of floor(sqrt(8*x+1)) + 1 are even so n can be incremented with steps of 2 (rewriting the n2 calculation accordingly: n2 += n + n + n + n + 4 that can itself be written better than this).
Related
The task is to find the sum of the equation given n and a. So for the equation 1a + 2a^2 + 3a^3 + ... + na^n, we can find the n-th element in the sequence with the following formula (from observation):
n-th element = a^n * (n-(n-2)/n-(n-1)) * (n-(n-3)/n-(n-2)) * ... * (n/(n-1))
I think that it's impossible to simplify the sum of n elements by modifying the above formula to a sum formula. Even if it is possible, I assume that it will involve the use of exponent n, which will introduce a n-time loop; thus causing the solution to not be O(log n). The best solution I can get is simply find the ratio of each element, which is a(n+1)/n and apply that to the n-1 element to find the n-th element.
I think that I may be missing something. Could someone provide me with solution(s)?
You can solve this problem, and lots of problems like it, with matrix exponentiation.
Let's start with this sequence:
A[n] = a + a^2 + a^3 ... + a^n
That sequence can be generated with a simple formula:
A[i] = a*(A[i-1] + 1)
Now if we consider your sequence:
B[n] = a + 2a^2 + 3a^3 ... + na^n
We can generate that with a formula that makes use of the previous one:
B[i] = (B[i-1] + A[i-1] + 1) * a
If we make a sequence of vectors containing all the components we need:
V[n] = (B[n], A[n], 1)
Then we can construct a matrix M so that:
V[i] = M*V[i-1]
And so:
V[n] = (M^(n-1))V[1]
Since the size of the matrix is fixed at 3x3, you can use exponentiation by squaring on the matrix itself to calculate M^(n-1) in O(log n) time, and the final multiplication takes constant time.
Here's an implementation in python with numpy (so I don't have to include matrix multiply code):
import numpy as np
def getSum(a,n):
# A[n] = a + a^2 + a^3...a^n
# B[n] = a + 2a^2 + 3a^3 +. .. na^n
# V[n] = [B[n],A[n],1]
M = np.matrix([
[a, a, a], # B[i] = B[i-1]*a + A[i-1]*a + a
[0, a, a], # A[i] = A[i-1]*a + a
[0, 0, 1]
])
# calculate MsupN = M^(n-1)
n-=1
MsupN=np.matrix([[1,0,0],[0,1,0],[0,0,1]]);
while(n>0):
if n%2 > 0:
MsupN *= M
n-=1
M*=M
n=n/2
# calculate V[n] = MsupN*V
Vn = MsupN*np.matrix([a,a,1]).T;
return Vn.item(0,0);
I assume a, n are nonnegative integers. The explicit formula for a > 1 is
a * (n * a^{n + 1} - (n + 1) * a^n + 1) / (a - 1)^2
It can be evaluated efficiently in O(log(n)) using
square and multiply for a^n.
To derive the formula, you could use the following ingredients:
explicit formula for geometric series
You have to notice that this polynomial looks almost like a derivative of a geometric series
Gaussian sum formula for the special case a = 1.
Now you can simply calculate:
sum_{i = 1}^n i * a^i // [0] ugly sum
= a * sum_{i = 1}^n i * a^{i-1} // [1] linearity
= a * d/da (sum_{i = 1}^n a^i) // [2] antiderivative
= a * d/da (sum_{i = 0}^n a^i - 1) // [3] + 1 - 1
= a * d/da ((a^{n + 1} - 1) / (a - 1) - 1) // [4] geom. series
= a * ((n + 1)*a^n / (a - 1) - (a^{n+1} - 1)/(a - 1)^2) // [5] derivative
= a * (n * a^{n + 1} - (n + 1)a^n + 1) / (a - 1)^2 // [6] explicit formula
This is just a simple arithmetic expression with a^n, which can be evaluated in O(log(n)) time using square-and-multiply.
This doesn't work for a = 0 or a = 1, so you have to treat those cases specially: for a = 0 you just return 0 immediately, for a = 1, you return n * (n + 1) / 2.
Scala snippet to test the formula:
def fast(a: Int, n: Int): Int = {
def pow(a: Int, n: Int): Int =
if (n == 0) 1
else if (n == 1) a
else {
val r = pow(a, n / 2)
if (n % 2 == 0) r * r else r * r * a
}
if (a == 0) 0
else if (a == 1) n * (n + 1) / 2
else {
val aPowN = pow(a, n)
val d = a - 1
a * (n * aPowN * a - (n + 1) * aPowN + 1) / (d * d)
}
}
Slower, but simpler version, for comparison:
def slow(a: Int, n: Int): Int = {
def slowPow(a: Int, n: Int): Int = if (n == 0) 1 else slowPow(a, n - 1) * a
(1 to n).map(i => i * slowPow(a, i)).sum
}
Comparison:
for (a <- 0 to 5; n <- 0 to 5) {
println(s"${slow(a, n)} <-> ${fast(a, n)}")
}
Output:
0 <-> 0
0 <-> 0
0 <-> 0
0 <-> 0
0 <-> 0
0 <-> 0
0 <-> 0
1 <-> 1
3 <-> 3
6 <-> 6
10 <-> 10
15 <-> 15
0 <-> 0
2 <-> 2
10 <-> 10
34 <-> 34
98 <-> 98
258 <-> 258
0 <-> 0
3 <-> 3
21 <-> 21
102 <-> 102
426 <-> 426
1641 <-> 1641
0 <-> 0
4 <-> 4
36 <-> 36
228 <-> 228
1252 <-> 1252
6372 <-> 6372
0 <-> 0
5 <-> 5
55 <-> 55
430 <-> 430
2930 <-> 2930
18555 <-> 18555
So, yes, the O(log(n)) formula gives the same numbers as the O(n^2) formula.
a^n can be indeed computed in O(log n).
The method is called Exponentiation by squaring and the main idea is that if you know a^n you also know a^(2*n) which is just a^n * a^n.
So if you want to compute a^n (if n is even) you can just compute a^(n/2) and multiply the result with itself: a^n = a^(n/2) * a^(n/2). So instead of having a loop up to n, now you only have a loop up to n/2. But n/2 is just another number, and can be computed the same way, thus doing only half the operations. Halving the number of operations each time leads to the logarithmic complexity.
As mentioned by #Sopel in the comment, the series can be written as a simple equation/function:
a * (n * a^(n+1) - (n+1) * a^n + 1)
f(a,n) = ------------------------------------
(a- 1) ^ 2
So to find the answer you only have to compute the above formula, using the fast exponentiation described above to do it in O(logN) complexity.
How to solve the following equation?
I am interested in the methods of solutions.
n^3 mod P = (n+1)^3 mod P
P- Prime number
Short example with the answer.
Could you gives step-by-step solutions for my example.
n^3 mod 61 = (n + 1)^3 mod 61
Integer solutions:
n = 61 m + 4,
n = 61 m + 56,
m element Z
Z - is set of integers.
An other way to state n^3 ≡ (n+1)^3 is n^3 ≡ n^3 + 3 n^2 + 3 n + 1 (just work out the cube of n+1) then the cubic terms cancel out to give a nicer quadratic 3 n^2 + 3 n + 1 ≡ 0
Then the usual quadratic formula applies, though all of its operations are now modulo P, and the determinant is not always a quadratic residue in which case there are no solutions to the original equation (this happens about half the time). This involves finding a square root modulo a prime, which is not hard for a computer to do for example with the Tonelli–Shanks algorithm, though not trivial to implement.
By the way 3 n^2 + 3 n + 1 = 0 has the property that if n is a solution, then -n - 1 is too.
For example, with some Python, once all the support functions exist it is pretty simple:
def solve(p):
# solve 3 n^2 + 3 n + 1 ≡ 0
D = -3 % p
sqrtD = modular_sqrt(D, p)
if sqrtD == 0:
return None
else:
n = (sqrtD - 3) * inverse(6, p) % p
return (n, -(n+1) % p)
Inverse modulo a prime is really easy,
def inverse(x, p):
return pow(x, p - 2, p)
I adapted this implementation of Tonelli-Shanks to Python3 (// instead of / for integer division)
def modular_sqrt(a, p):
""" Find a quadratic residue (mod p) of 'a'. p
must be an odd prime.
Solve the congruence of the form:
x^2 = a (mod p)
And returns x. Note that p - x is also a root.
0 is returned is no square root exists for
these a and p.
The Tonelli-Shanks algorithm is used (except
for some simple cases in which the solution
is known from an identity). This algorithm
runs in polynomial time (unless the
generalized Riemann hypothesis is false).
"""
# Simple cases
#
if legendre_symbol(a, p) != 1:
return 0
elif a == 0:
return 0
elif p == 2:
return 0
elif p % 4 == 3:
return pow(a, (p + 1) // 4, p)
# Partition p-1 to s * 2^e for an odd s (i.e.
# reduce all the powers of 2 from p-1)
#
s = p - 1
e = 0
while s % 2 == 0:
s //= 2
e += 1
# Find some 'n' with a legendre symbol n|p = -1.
# Shouldn't take long.
#
n = 2
while legendre_symbol(n, p) != -1:
n += 1
# Here be dragons!
# Read the paper "Square roots from 1; 24, 51,
# 10 to Dan Shanks" by Ezra Brown for more
# information
#
# x is a guess of the square root that gets better
# with each iteration.
# b is the "fudge factor" - by how much we're off
# with the guess. The invariant x^2 = ab (mod p)
# is maintained throughout the loop.
# g is used for successive powers of n to update
# both a and b
# r is the exponent - decreases with each update
#
x = pow(a, (s + 1) // 2, p)
b = pow(a, s, p)
g = pow(n, s, p)
r = e
while True:
t = b
m = 0
for m in range(r):
if t == 1:
break
t = pow(t, 2, p)
if m == 0:
return x
gs = pow(g, 2 ** (r - m - 1), p)
g = (gs * gs) % p
x = (x * gs) % p
b = (b * g) % p
r = m
def legendre_symbol(a, p):
""" Compute the Legendre symbol a|p using
Euler's criterion. p is a prime, a is
relatively prime to p (if p divides
a, then a|p = 0)
Returns 1 if a has a square root modulo
p, -1 otherwise.
"""
ls = pow(a, (p - 1) // 2, p)
return -1 if ls == p - 1 else ls
You can see some results on ideone
function foo(n)
if n = 1 then
return 1
else
return foo(rand(1, n))
end if
end function
If foo is initially called with m as the parameter, what is the expected number times that rand() would be called ?
BTW, rand(1,n) returns a uniformly distributed random integer in the range 1 to n.
A simple example is how many calls it takes to calculate f(2). Say this time is x, then x = 1 + 0/2 + x/2 because we do the actual call 1, then with probability 1/2 we go to f(1) and with probability 1/2 we stay at f(2). Solving the equation gives us x = 2.
As with most running time analysis of recursion, we try to get a recursive formula for the running time. We can use linearity of expectation to proceed through the random call:
E[T(1)] = 0
E[T(2)] = 1 + (E[T(1)] + E[T(2)])/2 = 2
E[T(n)] = 1 + (E[T(1)] + E[T(2)] + ... E[T(n)])/n
= 1 + (E[T(1)] + E[T(2)] + ... E[T(n-1)])/n + E[T(n)]/n
= 1 + (E[T(n-1)] - 1)(n-1)/n + E[T(n)]/n
Hence
E[T(n)](n-1) = n + (E[T(n-1)] - 1)(n-1)
And so, for n > 1:
E[T(n)] = 1/(n-1) + E[T(n-1)]
= 1/(n-1) + 1/(n-2) + ... + 1/2 + 2
= Harmonic(n-1) + 1
= O(log n)
This is also what we intuitively might have expected, since n should approximately half at each call to f.
We may also consider the 'Worst case with high probability'. For this it's easy to use Markov's inequality, which says P[X <= a*E[X]] >= 1-1/a. Setting a = 100 we get that with 99% probability, the algorithm makes less than 100 * log n calls to rand.
This question already has answers here:
Sum of Digits till a number which is given as input
(2 answers)
Closed 6 years ago.
Problem:
Find the sum of the digits of all the numbers from 1 to N (both ends included)
Time Complexity should be O(logN)
For N = 10 the sum is 1+2+3+4+5+6+7+8+9+(1+0) = 46
For N = 11 the sum is 1+2+3+4+5+6+7+8+9+(1+0)+(1+1) = 48
For N = 12 the sum is 1+2+3+4+5+6+7+8+9+(1+0)+(1+1) +(1+2)= 51
This recursive solution works like a charm, but I'd like to understand the rationale for reaching such a solution. I believe it's based on finite induction, but can someone show exactly how to solve this problem?
I've pasted (with minor modifications) the aforementioned solution:
static long Solution(long n)
{
if (n <= 0)
return 0;
if (n < 10)
return (n * (n + 1)) / 2; // sum of arithmetic progression
long x = long.Parse(n.ToString().Substring(0, 1)); // first digit
long y = long.Parse(n.ToString().Substring(1)); // remaining digits
int power = (int)Math.Pow(10, n.ToString().Length - 1);
// how to reach this recursive solution?
return (power * Solution(x - 1))
+ (x * (y + 1))
+ (x * Solution(power - 1))
+ Solution(y);
}
Unit test (which is NOT O(logN)):
long count = 0;
for (int i=1; i<=N; i++)
{
foreach (var c in i.ToString().ToCharArray())
count += int.Parse(c.ToString());
}
Or:
Enumerable.Range(1, N).SelectMany(
n => n.ToString().ToCharArray().Select(
c => int.Parse(c.ToString())
)
).Sum();
This is actually a O(n^log10(2))-time solution (log10(2) is approximately 0.3). Not sure if that matters. We have n = xy, where I use concatenation to denote concatenation, not multiplication. Here are the four key lines with commentary underneath.
return (power * Solution(x - 1))
This counts the contribution of the x place for the numbers from 1 inclusive to x*power exclusive. This recursive call doesn't contribute to the complexity because it returns in constant time.
+ (x * (y + 1))
This counts the contribution of the x place for the numbers from x*power inclusive to n inclusive.
+ (x * Solution(power - 1))
This counts the contribution of the lower-order places for the numbers from 1 inclusive to x*power exclusive. This call is on a number one digit shorter than n.
+ Solution(y);
This counts the contribution of the lower-order places for the numbers from x*power inclusive to n inclusive. This call is on a number one digit shorter than n.
We get the time bound from applying Case 1 of the Master Theorem. To get the running time down to O(log n), we can compute Solution(power - 1) analytically. I don't remember offhand what the closed form is.
After thinking for a while (and finding similar answers), I guess I could achieve the rationale that gave me another solution.
Definitions
Let S(n) be the sum of the digits of all numbers 0 <= k < n.
Let D(k) be the plain digits sum of k only.
(I'll omit parentheses for >clarity, so consider Dx = D(x)
If n>=10, let's decompose n by spliting the last digit and the tens (n = 10*k + r) (k, r being integers)
We need to sum S(n) = S(10*k + r) = S(10*k) + D(10*k+1) + ... + D(10*k+r)
The first part, S(10*k), follows a pattern:
S(10*1)=D1+D2+D3+...+D9 =(1+2+3+...+9) *1 + D10
S(10*2)=D1+D2+D3+...+D19 =(1+2+3+...+9) *2 +1*9 +D10 + D20
S(10*3)=D1+D2+D3+...+D29 =(1+2+3+...+9) *3 +1*9+2*9 +D10+...+D20 + D30
So S(10*k) = (1+2+3+...+9)*k + 9*S(k-1) + S(k-1) + D(10*k) = 45*k + 10*S(k-1) + D(10*k)
Regarding the last part, we know that D(10*k+x) = D(10*k)+D(x) = D(k)+x, so this last part can be simplified:
D(10*k+1) + ... + D(10*k+r) = D(k)+1 + D(k)+2 + ... D(k)+r = rD(k) + (1+2+...+r) = rD(k) + r*(1+r)/2
So, adding both parts of the equation (and grouping D(k)) we have:
S(n) = 45*k + 10*S(k-1) + (1+r)D(k) + r*(1+r)/2
And replacing k and r we have:
S(n) = 45*k + 10*S((n/10)-1) + (1+n%10)D(n/10) + n%10(1+n%10)/2
Pseudocode:
S(n):
if n=0, sum=0
if n<10, n*(1+n)/2
r=n%10 # let's decompose n = 10*k + r (being k, r integers).
k=n/10
return 45*k + 10*S((n/10)-1) + (1+n%10)*D(n/10) + n%10*(1+n%10)/2
D(n):
just sum digits
First algorithm (the one from the original question) in C#
static BigInteger Solution(BigInteger n)
{
if (n <= 0)
return 0;
if (n < 10)
return (n * (n + 1)) / 2; // sum of arithmetic progression
long x = long.Parse(n.ToString().Substring(0, 1)); // first digit
long y = long.Parse(n.ToString().Substring(1)); // remaining digits
BigInteger power = BigInteger.Pow(10, n.ToString().Length - 1);
var log = Math.Round(BigInteger.Log10(power)); // BigInteger.Log10 can give rounding errors like 2.99999
return (power * Solution(x - 1)) //This counts the contribution of the x place for the numbers from 1 inclusive to x*power exclusive. This recursive call doesn't contribute to the complexity because it returns in constant time.
+ (x * (y + 1)) //This counts the contribution of the x place for the numbers from x*power inclusive to n inclusive.
//+ (x * Solution(power - 1)) // This counts the contribution of the lower-order places for the numbers from 1 inclusive to x*power exclusive. This call is on a number one digit shorter than n.
+ (x * 45*new BigInteger(log)* BigInteger.Pow(10,(int)log-1)) //
+ Solution(y);
}
Second algorithm (deduced from formula above) in C#
static BigInteger Solution2(BigInteger n)
{
if (n <= 0)
return 0;
if (n < 10)
return (n * (n + 1)) / 2; // sum of arithmetic progression
BigInteger r = BigInteger.ModPow(n, 1, 10); // decompose n = 10*k + r
BigInteger k = BigInteger.Divide(n, 10);
return 45 * k
+ 10*Solution2(k-1) // 10*S((n/10)-1)
+ (1+r) * (k.ToString().ToCharArray().Select(x => int.Parse(x.ToString())).Sum()) // (1+n%10)*D(n/10)
+ (r * (r + 1)) / 2; //n%10*(1+n%10)/2
}
EDIT: According to my tests, it's running faster than both the original version (which was using recursion twice), and the version modified to calculate Solution(power - 1) in a single step.
PS: I'm not sure, but I guess that if I had splitted the first digit of the number instead of the last, maybe I could achieve a solution like the original algorithm.
Question from the interview at f2f interview at MS:
Determine the number of integral solutions of
x1 + x2 + x3 + x4 + x5 = N
where 0 <= xi <= N
So basically we need to find the number of partitions of N in at most 5 parts
Supposed to be solved with paper and pencil. Did not make much headway though, does anybody have a solution for this?
Assume numbers are strictly > 0.
Consider an integer segment [0, N]. The problem is to split it into 4 segments of positive length. Imagine we do that by putting 4 splitter dots between adjacent numbers. How many ways to do that ? C(N-1, 4).
Now, some numbers can be 0-s. Let k be number of non-zero numbers. We can choose them in C(5,k) ways, for each having C(N-1, k) splittings. Accumulating by all k in [0,5] range, we get
Sum[ C(5,k) * C(n-1,k); k = 0 to 5]
#Grigor Gevorgyan indeed gives the right way to figure out the solution.
think about when
1 <= xi
that's exactly dividing N points into 5 segments. it's equivalent to insert 4 "splitter dots" out of N-1 possible places ( between adjacent numbers). So the answer is C(N-1,4)
then what about when
0 <= xi
?
If you have the solution of X+5 points in
1 <= xi
whose answer is C(N-1,4)=C(X+5-1,4)=C(X+4,4)
then you simply remove one point from each set, and you have a solution of X points, with
0 <= xi
which means,the answer now is exactly equal to C(X+4,4)
Topcoder tutorials
Look for the section "Combination with repetition" : The specific case is explained under that section with diagrmatic illustration .(A picture is worth quite a few words!)
You have the answer here.
It is classical problem -
Number of options to put N balls in M boxes = c(M+N-1,N).
The combinations solution is more appropriate if a pen and paper solution was asked. It's also the classic solution. Here is a dynamic programming solution.
Let dp[i, N] = number of solutions of x1 + x2 + ... +xi = N.
Let's take x1 + x2 = N:
We have the solutions:
0 + N = N
1 + N - 1 = N
...
N + 0 = N
So dp[2, N] = N + 1 solutions.
Let's take x1 + x2 + x3 = N:
We have the solutions:
0 + (0 + N) = N
0 + (1 + N - 1) = N
...
0 + (N + 0) = N
...
Notice that there are N + 1 solutions thus far. Moving on:
1 + (0 + N - 1) = N
1 + (1 + N - 2) = N
...
1 + (N - 1 + 0) = N
...
Notice that there are another N solutions. Moving on:
...
N - 1 + (0 + 1) = N
N - 1 + (1 + 0) = N
=> +2 solutions
N + (0 + 0) = N
=> +1 solution
So we have dp[3, N] = dp[2, N] + dp[2, N - 1] + dp[2, N - 2] + ... + dp[2, 0].
Also notice that dp[k, 0] = 1
Since for each row of the matrix we need N summations, the complexity for computing dp[k, N] is O(k*N), which is just as much as would be needed for the combinatorics solution.
To keep the complexity for each row O(N), store s[i] = sum of the first i elements on the previous row. The memory used can also be reduced to O(N).