f[0] = 0;
f[1] = 1;
f[x_] := f[x-1] + f[x-2]
This function is running slow in Mathematica and I need to increase the speed. I have to use functional programming and recursion. I'm not sure why this is running so slow, and even the slightest idea how to improve this would be helpful.

A good way to write a faster recursive function is to have it memorize previous values. This does come at the cost of memory, of course, but it can help in cases like this. In order to calculate f[x], you calculate f[x-1] and f[x-2] - and then to calculate f[x-1], you calculate f[x-2] again; you end up recalculating a lot of values a lot of times. (Forgive my imprecision!)
To store things as you go, you can use this idiom:
f[x_] := ( f[x] = (* calculation of f[x] goes here *) )
Edit: I don't have mathematica on this machine, but I'm pretty sure there's no way this computes the wrong value.
f[0] = 0;
f[1] = 1;
f[x_] := ( f[x] = f[x-1] + f[x-2] );
Like I said in a comment below, if you have other definitions of f, you might want to wipe them out first with Clear[f].
Thanks to rcollyer: Be careful about $RecursionLimit! It defaults to 256. (Of course, this is with good reason. Really deep recursion is usually a bad idea.)

Jefromi is right. Look at Memoization on wikipedia. They use the example of factorial and how to speed it up with memoization.

Memoization is a good way to write a faster recursive function. However, in this case there is a recursive alternative that runs tremendously faster than original function, without requiring memoization.
The key observation is to see that the original definition performs a lot of redundant calculations. Consider what happens if we calculate fib[4]:
fib[4] = fib[3] + fib[2]
fib[3] = fib[2] + fib[1]
fib[2] = fib[1] + fib[0]
fib[1] = 1
fib[0] = 1
∴ fib[2] = 1 + 1 = 2
fib[1] = 1
∴ fib[3] = 2 + 1 = 3
fib[2] = fib[1] + fib[0]
fib[1] = 1
fib[0] = 1
∴ fib[2] = 1 + 1 = 2
∴ fib[4] = 2 + 1 = 3
In this process, fib[2] and fib[0] were computed twice each and fib[1] was computed thrice. For larger computations, the waste grows dramatically -- exponentially in fact.
If one were to calculate the same Fibonacci number by hand, one might proceed something like this:
0: given 0
1: given 1
2: 0 + 1 = 1
3: 1 + 1 = 2
4: 1 + 2 = 3
There are no redundant calculations. At any given point, one only needs to consider the previous two results. This latter approach can be expressed recursively thus:
fib2[0] = 0;
fib2[n_] :=
f[n, p1_, _] := p1;
f[x_, p1_, p2_] := f[x + 1, p1 + p2, p1];
f[1, 1, 0]
Block[{$IterationLimit = Infinity}, fib2[100000]]
No doubt, this form is not as easy to read as the original. On the other hand, the original function took 35 seconds to compute fib[35] on my machine whereas the revised function's runtime for same was reported as zero. Furthermore, the revised function computes fib2[100000] in 0.281 seconds, without requiring any of the extra storage of memoization. fib[100000] is quite out of reach of the original function and the memoized version crashed my Mathematica 7.01 kernel -- too many memoized rules perhaps?
Note that Mathematica, by default, will iterate a function no more than 4096 times. To raise that limit, you must assign a higher value to $IterationLimit as illustrated in the example above.
Of course, in Mathematica there are plenty of non-recursive ways to calculate Fibonacci numbers, up to and including the built-in Fibonacci function. But that is not the point of this exercise.
Tail Call Optimization?
It is always desirable to express recursive functions using tail calls. This permits the recursion to be executed by simple iteration, without the overhead of retaining intermediate results on the stack. fib2 is tail recursive. Some languages, like Scheme, mandate tail call optimization. Other languages, like Java, could support it but don't (or won't, as in the case of Python).
In the case of Mathematica, it is not clear to what extent tail call optimization is performed. For further discussion of this point, see another SO question.


Why does the same function written in different ways have such different results time?

I've been playing with wolfram language and noticed something: the same function written in different ways works very differently in terms of time.
Consider these two functions:
NthFibonacci[num_] :=
If [num == 0 || num == 1, Return[ 1],
Return[NthFibonacci[num - 1] + NthFibonacci[num - 2]]
Fibn[num_] := {
a = 1;
b = 1;
For[i = 0, i < num - 1, i++,
c = a + b;
a = b;
b = c;
Return [b];
NthFibonacci[30] takes around 5 seconds to evaluate.
Fibn[900 000] also takes around 5 seconds to evaluate.
So does the built-in Fibonacci[50 000 000]
I simply can't get why are there such differences in speed between the three. In theory, recursion should be more or less equivalent to a for loop. What is causing this?
It's because the recursive version you present does lots and lots of repeated calculations. Build a tree of the function calls to see what I mean. Even for an argument as small as 4, look at how many function calls are generated to get to a base case down each chain of the logic.
/ \
f(3) f(0)
/ \
/ f(1)
\ f(1)
\ /
With your recursion, the number of function calls grows exponentially with the argument num.
By contrast, your looped version grows linearly in num. It doesn't take a very large value of n before n is a lot less work than 2n.
There are many ways to implement recursion; the Fibonacci function is a lovely example. As pjs already pointed out, the classic, double-recursive definition grows exponentially. The base is
φ = (sqrt(5)+1) / 2 = 1.618+
Your NthFibonacci implementation works this way. It's order φ^n, meaning that for large n, calling f(n+1) takes φ times as long as f(n).
The gentler approach computes each functional value only once in the stream of execution. Instead of exponential time, it takes linear time, meaning that calling f(2n) takes 2 times as long as f(n).
There are other approaches. For instance, Dynamic Programming (DP) keeps a cache of previous results. In pjs f(4) case, a DP implementation would compute f(2) only once; the second call would see that the result of the first was in cache, and return the result rather than making further calls to f(0) and f(1). This tends toward linear time.
There are also implementations that make checkpoints, such as caching f(k) and f(k)+1 for k divisible by 1000. These save time by have a starting point not too far below the desired value, giving them an upper bound of 998 iterations to find the needed value.
Ultimately, the fastest implementations use the direct computation (at least for larger numbers) and work in constant time.
φ = (1+sqrt(5)) / 2 = 1.618...
ψ = (1-sqrt(5)) / 2 = -.618...
f(n) = (φ^n - ψ^n) / sqrt(5)
The issue noted by #pjs can be addressed to a degree by having the recursive function remember prior values. (eliminating the If helps too)
NthFibonacci[0] = 1
NthFibonacci[1] = 1
NthFibonacci[num_] :=
NthFibonacci[num] = NthFibonacci[num - 1] + NthFibonacci[num - 2]
NthFibonacci[300] // AbsoluteTiming
{0.00201479, 3.59 10^62}
Cleaning up you loop version as well (You should almost never use Return in mathematica):
Fibn[num_] := Module[{a = 1, b = 1,c},
Do[c = a + b; a = b; b = c, {num - 1}]; b]
Fibn[300] // AbsoluteTiming
{0.000522175 ,3.59 10^62}
you see the recursive form is slower, but not horribly so. (Note the recursive form hits a depth limit around 1000 as well )

Fortran multidimensional sub-array performance

While manipulating and assigning sub-arrays within multidimensional arrays in Fortran90, I stumbled across an interesting performance quirk.
Fortran90 introduced the ability to manipulate sub-sections of arrays and I have seen a few places which recommends that array operations be performed using this "slicing" method instead of loops. For instance, if I have to add two arrays, a and b of size 10, it is better to write:
c(1:10) = a(1:10) + b(1:10)
c = a + b
Instead of
do i = 1, 10
c(i) = a(i) + b(i)
end do
I tried this method for simple one dimensional and two dimensional arrays and found it to be faster with the "slicing" notation. However, things began to get a little interesting when assigning such results within multidimensional arrays.
First of all, I must apologize for my rather crude performance measuring exercise. I am not even sure if the method I have adopted is the right way to time and test codes, but I am fairly confident about the qualitative results of the test.
program main
implicit none
integer, parameter :: mSize = 10000
integer :: i, j
integer :: pCnt, nCnt, cntRt, cntMx
integer, dimension(mSize, mSize) :: a, b
integer, dimension(mSize, mSize, 3) :: c
pCnt = 0
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "First call: ", nCnt-pCnt
pCnt = nCnt
do j = 1, mSize
do i = 1, mSize
a(i, j) = i*j
b(i, j) = i+j
end do
end do
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "Created Matrices: ", nCnt-pCnt
pCnt = nCnt
!c(1:mSize, 1:mSize, 1) = a + b
!c(1:mSize, 1:mSize, 2) = a - b
!c(1:mSize, 1:mSize, 3) = a * b
do j = 1, mSize
do i = 1, mSize
c(i, j, 1) = a(i, j) + b(i, j)
c(i, j, 2) = a(i, j) - b(i, j)
c(i, j, 3) = a(i, j) * b(i, j)
end do
end do
call SYSTEM_CLOCK(nCnt, cntRt, cntMx)
print *, "Added Matrices: ", nCnt-pCnt
pCnt = nCnt
end program main
As can be seen, I have two methods of operating upon and assigning two large 2D arrays into a 3D array. I was heavily in favour of using the slicing notation as it helped me write shorter and more elegant looking code. But upon observing how severely sluggish my code was, I was forced to recheck the capacity of slicing notation over calculating within loops.
I ran the above code with and without -O3 flag using GNU Fortran 4.8.4 for Ubuntu 14.04
Without -O3 flag
a. Slicing notation
5 Runs - 843, 842, 842, 841, 859
Average - 845.4
b. Looped calculation
5 Runs - 1713, 1713, 1723, 1711, 1713
Average - 1714.6
With -O3 flag
a. Slicing notation
5 Runs - 545, 545, 544, 544, 548
Average - 545.2
b. Looped calculation
5 Runs - 479, 477, 475, 472, 472
Average - 475
I found it very interesting that without -O3 flag, the slicing notation continued to perform way better than loops. However, using -O3 flag causes this advantage to vanish completely. Contrarily, it becomes detrimental to use array slicing notation in this case.
In fact, with my rather large 3D parallel computation code, this is turning out to be a significant bottle-neck. I strongly suspect that the formation of array temporaries during the assignment of a lower dimensional array to a higher dimensional array is the culprit here. But why did the optimization flag fail to optimize the assignment in this case?
Moreover, I feel that blaming -O3 flag is not a respectable thing to do. So are array temporaries really the culprit? Is there something else I may be missing? Any insight will be extremely helpful in speeding up my code. Thanks!
When doing any performance comparison, you have to compare apple with apples and orange with oranges. What I mean is that you are not really comparing the same thing. They are totally different even if they are producing the same result.
What comes into play here is the memory management, think of cache faults during the operation. If you turn the loop version into 3 different loops as suggested by haraldkl you will certainly get similar performance.
What happens is that when you combine the 3 assignments in the same loop, there is a lot of cache reuse for right hand side since all the 3 share the same variables in the right hand side. Each element of a or b is loaded into the cache and into registers only once for the loop version while for the array operation version, each element of a or b gets loaded 3 times. That is what makes the difference. The larger the size of the array, the larger the difference, because you will get more cache fault and more reloading of elements into the registers.
I don't know what the compiler really does so not really an answer, but too much text for a comment...
I'd have the suspicion that the compiler expands the array notation into something like this:
do j = 1, mSize
do i = 1, mSize
c(i, j, 1) = a(i, j) + b(i, j)
end do
end do
do j = 1, mSize
do i = 1, mSize
c(i, j, 2) = a(i, j) - b(i, j)
end do
end do
do j = 1, mSize
do i = 1, mSize
c(i, j, 3) = a(i, j) * b(i, j)
end do
end do
Of course, the compiler might still collapse these loops if written like that, so you might need to confuse him a little more, for example by writing something of c to the screen between the loops.

Dynamic programming assembly line scheduling

I am reading about Dynamic programming in Cormen etc book on algorithms. following is text from book
Suppose we have motor car factory with two assesmly lines called as line 1 and line 2. We have to determine fastest time to get chassis all the way.
Ultimate goal is to determine the fastest time to get a chassis all the way through the factory, which we denote by Fn. The chasssis has to get all the way through station "n" on either line 1 or line 2 and then to factory exit. Since the faster of these ways is the fastest way through the entire factory, we have
Fn = min(f1[n] + x1, f2[n]+x2) ---------------- Eq1
Above x1 and x2 final additional time for comming out from line 1 and line 2
I have following recurrence equations. Consider following are Eq2.
f1[j] = e1 + a1,1 if j = 1
min(f1[j-1] + a1,j, f2[j-1] + t2,j-1 + a1,j if j >= 2
f2[j] = e2 + a2,1 if j = 1
min(f2[j-1] + a2,j, f1[j-1] + t1,j-1 + a2,j if j >= 2
Let Ri(j) be the number of references made to fi[j] in a recursive algorithm.
From equation R1(n) = R2(n) = 1
From equation 2 above we have
R1(j) = R2(j) = R1(j+1) + R2(j+1) for j = 1, 2, ...n-1
My question is how author came with R(n) =1 because usally we have base case as 0 rather than n, here then how we will write recursive functions in code
for example C code?
Another question is how author came up with R1(j) and R2(j)?
Thanks for all the help.
If you solve the problem in a recursive way, what would you do?
You'd start calculating F(n). F(n) would recursively call f1(n-1) and f2(n-1) until getting to the leaves (f1(0), f2(0)), right?
So, that's the reason the number of references to F(n) in the recursive solution is 1, because you'd need to compute f1(n) and f2(n) only once. This is not true to f1(n-1), which is referenced when you compute f1(n) and when you compute f2(n).
Now, how did he come up with R1(j) = R2(j) = R1(j+1) + R2(j+1)?
well, computing it in a recursive way, every time you need f1(i), you have to compute f1(j), f2(j), for every j in the interval [0, i) -- AKA for every j smaller than i.
In other words, the value of f1,2(i) depends on the value of f1,2(0..i-1), so every time you compute a f_(i), you're computing EVERY f1,2(1..i-1) - (because it depends on their value).
For this reason, the number of times you compute f_(i) depends on how many f1,2 there are "above him".
Hope that's clear.

Algorithm for multidimensional optimization / root-finding / something

I have five values, A, B, C, D and E.
Given the constraint A + B + C + D + E = 1, and five functions F(A), F(B), F(C), F(D), F(E), I need to solve for A through E such that F(A) = F(B) = F(C) = F(D) = F(E).
What's the best algorithm/approach to use for this? I don't care if I have to write it myself, I would just like to know where to look.
EDIT: These are nonlinear functions. Beyond that, they can't be characterized. Some of them may eventually be interpolated from a table of data.
There is no general answer to this question. A solver finding the solution to any equation does not exist. As Lance Roberts already says, you have to know more about the functions. Just a few examples
If the functions are twice differentiable, and you can compute the first derivative, you might try a variant of Newton-Raphson
Have a look at the Lagrange Multiplier Method for implementing the constraint.
If the function F is continuous (which it probably is, if it is an interpolant), you could also try the Bisection Method, which is a lot like binary search.
Before you can solve the problem, you really need to know more about the function you're studying.
As others have already posted, we do need some more information on the functions. However, given that, we can still try to solve the following relaxation with a standard non-linear programming toolbox.
min k
A + B + C + D + E = 1
F1(A) - k = 0
F2(B) - k = 0
F3(C) -k = 0
F4(D) - k = 0
F5(E) -k = 0
Now we can solve this in any manner we wish, such as penalty method
min k + mu*sum(Fi(x_i) - k)^2
A+B+C+D+E = 1
or a straightforward SQP or interior-point method.
More details and I can help advise as to a good method.
The functions are all monotonically increasing with their argument. Beyond that, they can't be characterized. The approach that worked turned out to be:
1) Start with A = B = C = D = E = 1/5
2) Compute F1(A) through F5(E), and recalculate A through E such that each function equals that sum divided by 5 (the average).
3) Rescale the new A through E so that they all sum to 1, and recompute F1 through F5.
4) Repeat until satisfied.
It converges surprisingly fast - just a few iterations. Of course, each iteration requires 5 root finds for step 2.
One solution of the equations
A + B + C + D + E = 1
F(A) = F(B) = F(C) = F(D) = F(E)
is to take A, B, C, D and E all equal to 1/5. Not sure though whether that is what you want ...
Added after John's comment (thanks!)
Assuming the second equation should read F1(A) = F2(B) = F3(C) = F4(D) = F5(E), I'd use the Newton-Raphson method (see Martijn's answer). You can eliminate one variable by setting E = 1 - A - B - C - D. At every step of the iteration you need to solve a 4x4 system. The biggest problem is probably where to start the iteration. One possibility is to start at a random point, do some iterations, and if you're not getting anywhere, pick another random point and start again.
Keep in mind that if you really don't know anything about the function then there need not be a solution.
ALGENCAN (part of TANGO) is really nice. There are Python bindings, too. - " general nonlinear programming that does not use matrix manipulations at all and, so, is able to solve extremely large problems with moderate computer time. The general algorithm is of Augmented Lagrangian type ... "
Google OPTIF9 or ALLUNC. We use these for general optimization.
You could use standard search technic as the others mentioned. There are a few optimization you could make use of it while doing the search.
First of all, you only need to solve A,B,C,D because 1-E = A+B+C+D.
Second, you have F(A) = F(B) = F(C) = F(D), then you can search for A. Once you get F(A), you could solve B, C, D if that is possible. If it is not possible to solve the functions, you need to continue search each variable, but now you have a limited range to search for because A+B+C+D <= 1.
If your search is discrete and finite, the above optimizations should work reasonable well.
I would try Particle Swarm Optimization first. It is very easy to implement and tweak. See the Wiki page for it.

Performance of swapping two elements in MATLAB

Purely as an experiment, I'm writing sort functions in MATLAB then running these through the MATLAB profiler. The aspect I find most perplexing is to do with swapping elements.
I've found that the "official" way of swapping two elements in a matrix
self.Data([i1, i2]) = self.Data([i2, i1])
runs much slower than doing it in four lines of code:
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
The total length of time taken up by the second example is 12 times less than the single line of code in the first example.
Would somebody have an explanation as to why?
Based on suggestions posted, I've run some more tests.
It appears the performance hit comes when the same matrix is referenced in both the LHS and RHS of the assignment.
My theory is that MATLAB uses an internal reference-counting / copy-on-write mechanism, and this is causing the entire matrix to be copied internally when it's referenced on both sides. (This is a guess because I don't know the MATLAB internals).
Here are the results from calling the function 885548 times. (The difference here is times four, not times twelve as I originally posted. Each of the functions have the additional function-wrapping overhead, while in my initial post I just summed up the individual lines).
swap1: 12.547 s
swap2: 14.301 s
swap3: 51.739 s
Here's the code:
methods (Access = public)
function swap(self, i1, i2)
swap1(self, i1, i2);
swap2(self, i1, i2);
swap3(self, i1, i2);
self.SwapCount = self.SwapCount + 1;
methods (Access = private)
% swap1: stores values in temporary doubles
% This has the best performance
function swap1(self, i1, i2)
e1 = self.Data(i1);
e2 = self.Data(i2);
self.Data(i1) = e2;
self.Data(i2) = e1;
% swap2: stores values in a temporary matrix
% Marginally slower than swap1
function swap2(self, i1, i2)
m = self.Data([i1, i2]);
self.Data([i2, i1]) = m;
% swap3: does not use variables for storage.
% This has the worst performance
function swap3(self, i1, i2)
self.Data([i1, i2]) = self.Data([i2, i1]);
In the first (slow) approach, the RHS value is a matrix, so I think MATLAB incurs a performance penalty in creating a new matrix to store the two elements. The second (fast) approach avoids this by working directly with the elements.
Check out the "Techniques for Improving Performance" article on MathWorks for ways to improve your MATLAB code.
you could also do:
tmp = self.Data(i1);
self.Data(i1) = self.Data(i2);
self.Data(i2) = tmp;
Zach is potentially right in that a temporary copy of the matrix may be made to perform the first operation, although I would hazard a guess that there is some internal optimization within MATLAB that attempts to avoid this. It may be a function of the version of MATLAB you are using. I tried both of your cases in version (a couple years old) and only saw a speed difference of about 2-2.5.
It's possible that this may be an example of speed improvement by what's called "loop unrolling". When doing vector operations, at some level within the internal code there is likely a FOR loop which loops over the indices you are swapping. By performing the scalar operations in the second example, you are avoiding any overhead from loops. Note these two (somewhat silly) examples:
vec = [1 2 3 4];
%Example 1:
for i = 1:4,
vec(i) = vec(i)+1;
%Example 2:
vec(1) = vec(1)+1;
vec(2) = vec(2)+1;
vec(3) = vec(3)+1;
vec(4) = vec(4)+1;
Admittedly, it would be much easier to simply use vector operations like:
vec = vec+1;
but the examples above are for the purpose of illustration. When I repeat each example multiple times over and time them, Example 2 is actually somewhat faster than Example 1. For a small loop with a known number (in the example, just 4), it can actually be more efficient to forgo the loop. Of course, in this particular example, the vector operation given above is actually the fastest.
I usually follow this rule: Try a few different things, and pick the fastest for your specific problem.
This post deserves an update, since the JIT compiler is now a thing (since R2015b) and so is timeit (since R2013b) for more reliable function timing.
Below is a short benchmarking function for element swapping within a large array.
I have used the terms "directly swapping" and "using a temporary variable" to describe the two methods in the question respectively.
The results are pretty staggering, the performance of directly swapping 2 elements using is increasingly poor by comparison to using a temporary variable.
function benchie()
% Variables for plotting, loop to increase size of the arrays
M = 15; D = zeros(1,M); W = zeros(1,M);
for n = 1:M;
N = 2^n;
% Create some random array of length N, and random indices to swap
v = rand(N,1);
x = randi([1, N], N, 1);
y = randi([1, N], N, 1);
% Time the functions
D(n) = timeit(#()direct);
W(n) = timeit(#()withtemp);
% Plotting
plot(2.^(1:M), D, 2.^(1:M), W);
legend('direct', 'with temp')
xlabel('number of elements'); ylabel('time (s)')
function direct()
% Direct swapping of two elements
for k = 1:N
v([x(k) y(k)]) = v([y(k) x(k)]);
function withtemp()
% Using an intermediate temporary variable
for k = 1:N
tmp = v(y(k));
v(y(k)) = v(x(k));
v(x(k)) = tmp;
