Performance of multiple generators in Scala 'for' expressions?

Performance of multiple generators in Scala 'for' expressions? - performance

An example:
for(x <- c1; y <- c2; z <- c3) yield {...}
which can be translated into:
c1.flatMap(x => c2.flatMap(y => c3.map(z => {...})))
or:
for(x <- c1) for (y <- c2) for (z <- c3) yield {...}
Time complexity is O(c1 * c2 * c3) according to the second translation, what if c1, c2, c3 are very big numbers?
I tried to search on the internet, and I found there are many codes who have much more generators than this example.
Question: In other languages, we want to avoid nested for loops, but why scala designed this mechanism, dose it have some special tricks to make a very good performance for the nested loop ?
If I misunderstand something, please tell me, I would like to know better the for-comprehension in scala. Thank you.

There are two reasons why one may wish to avoid nested loops. The first one, the readability and quality of the code (especially in OOP). The second, as you said, performance, but it isn't a hard rule. This is more like an indicator, that you may have in this place slower code and it's worth verifying whether it is possible to do the same thing faster. Obviously, many problems just require O(n^2) steps and it doesn't matter if you implement it using nested loops or some other way. E.g. most trivial case, if you have two collections of size n and you want to print all pairs (Cartesian product), it just requires O(n^2) steps and that's it.
And remember that for-comprehension in Scala isn't just a replacement for for instructions from other languages. It gives you a nice way for working with Monads and writing more readable code. Having multiple expressions inside the for doesn't necessarily mean that you have a performance problem here.

Related

construct a structured matrix efficiently in fortran

Having left Fortran for several years, now I have to pick it up and start to work with it again.
I'd like to construct a matrix with entry(i,j) in the form f(x_i,y_j), where f is a function of two variables, e.g., f(x,y)=cos(x-y). In Matlab or Python(Numpy), there are efficient ways to handle this kind of specific issue. I wonder whether there is such optimization in Fortran.
BTW, is it also true in Fortran that a vectorized operation is faster than a do/for loop (as is the case in Matlab and Numpy) ?

If you mean by vectorized the same as you mean in Matlab and Python, the short form you call on whole array then no, these forms are often slower, because they mey be harder to optimize than simple loops. What is faster is when the compiler actually uses the vector instructions of the CPU, but that is something else. And it is easier for the compiler to use them for simple loops.
Fortran has elemental functions, do concurrent, forall and where constructs, implied loops and array constructors. There is no point repeating them here, they have been described many times on this site or in tutorials.
Your example is most simply done using a loop
do j = 1, ny
do i = 1, nx
entry(i,j) = f(x(i), y(j))
end do
end do
One of the short ways, you probably meant by Python-like vectorization, would be the whole-array operations, e.g.,
A = cos(B)
C = A * B
D = f(A*B)
and similar. The function (which is called on each element of the array), must be elemental. These operations are not necessarily efficient. For example, the last call may require a temporary array to be created, which would be avoided when using a loop.

Are Haskell List Comprehensions Inefficient?

I started doing Project Euler and got to problem number 9. Since I was using Project Euler to learn Haskell, I decided to use list comprehensions (as shown in Learn You A Haskell). I do that and GHCI takes awhile to figure out the triplet, which I figured is normal because of the calculations involved. Now, at work yesterday (I don't work as a programmer professionally, yet) I was talking to a friend who knows VBA and he wanted to try to find the answers in VBA. I thought it would be a fun challenge as well, and I churn out some basic for loops and if statements, but what got me was that it was much faster than Haskell was.
My question is: are Haskell's list comprehension incredibly inefficient? At first I thought it was just because I was in GHC's interactive mode, but then I realized VBA is interpreted too.
Please note, I didn't post my code because of it being an answer to project euler. If it will answer my question (as in I'm doing something wrong) then I will gladly post the code.
[edit]
Here is my Haskell list comprehension:
[(a,b,c) | c <- [1..1000], b <- [1..c], a <- [1..b], a+b+c=1000, a^2+b^2=c^2]
I guess I could've lowered the range on c but is that what is really slowing it down?

There are two things you could be doing with this problem that could make your code slow. One is how you are trying values for a, b and c. If you loop through all possible values for a, b, c from 1 to 1000, you'll be spending a long time. To give a hint, you can make use of a+b+c=1000 if you rearrange it for c. The other is that if you only use a list comprehension, it will process every possible value for a, b and c. The problem tells you that there is only one unique set of numbers that satisfies the problem, so if you change your answer from this:
[ a * b * c | .... ]
to:
head [ a * b * c | ... ]
then Haskell's lazy evaluation means that it will stop after finding the first answer. This is the Haskell equivalent of breaking out of your VBA loop when you find the first answer. When I used both these tips, I had an answer that completed very quickly (under a second) in ghci.
Addendum: I missed at first the condition a < b < c. You can also make use of this in your list comprehensions; it is valid to say things along the lines of:
[(a, b) | b <- [1..100], a <- [1..b-1]]

Consider this simplified version of your list comprehension:
[(a,b,c) | a <- [1..1000], b <- [1..1000], c <- [1..1000]]
This will give all possible combinations of a, b, and c. It's kind of like saying, "how many ways can three one-thousand-sided dice land?" The answer is 1000*1000*1000 = 1,000,000,000 different combinations. If it took 0.001 seconds to generate each combination, it would therefore take 1,000,000 seconds (~11.5 days) to finish all combinations. (OK, 0.001 seconds is actually pretty slow for a computer, but you get the idea)
When you add predicates to your list comprehension, it still takes the same amount of time to compute; in fact, it takes longer since it needs to check the predicate for each of the 1 billion combinations it computes.
Now consider your comprehension. It looks like it should be much faster, right?
[(a,b,c) | c <- [1..1000], b <- [1..c], a <- [1..b], a+b+c=1000, a^2+b^2=c^2]
There are 1000 choices for c. How many are there for b and a? Well, the average choice for c is 500. For all choices of c, then, there are an average of 500 choices for b (since b can range from 1 to c). Likewise, for all choices of c and b, there are an average of 250 choices for a. That's very hand-wavy, but I'm fairly sure it's accurate. So 1000 choices for c * 1000/2 choices for b * 1000/4 choices for a = 1 billion / 8 ~= 100 million. It's 8x faster, but if you paid attention, you'll notice it's actually the same big-Oh complexity as the simplified version above. If we compared "simplified" vs "improved" versions of the same problem, but from [1..100000] instead of [1..1000], the "improved" would still only be 8x faster than the "simplified".
Don't get me wrong, 8x is a wonderful constant-factor speedup. But unless you want to wait a couple hours to get the solution, you'll need to get a better big-Oh.
As Neil noted, the way to reduce the complexity of this problem is, for a given b and c, choose the a that satisfies a+b+c=1000. That way, you're not trying a bunch of as that will fail. This will drop the big-Oh complexity; you'll only be considering approximately 1000 * 500 * 1 = 500,000 combinations, instead of ~100,000,000.

Once you get the solution to the problem you can check out other peoples versions of Haskell solutions on the Project Euler site to get an idea of how other people have solved the problem. Incidentally, here is a link to the referenced problem: http://projecteuler.net/index.php?section=problems&id=9

In addition to what everyone else has said about generating fewer elements in the generators, you can also switch to using Int instead of Integer as the type of the numbers. The default is Integer, but your numbers are small enough to fit in an Int.
(Also, to nitpick, Haskell list comprehensions have no speed. Haskell is a language definition with very little operational semantics. A particular Haskell implementation might have slow list comprehensions, though.)

Iterative solving for unknowns in a fluids problem

I am a Mechanical engineer with a computer scientist question. This is an example of what the equations I'm working with are like:
x = √((y-z)×2/r)
z = f×(L/D)×(x/2g)
f = something crazy with x in it
etc…(there are more equations with x in it)
The situation is this:
I need r to find x, but I need x to find z. I also need x to find f which is a part of finding z. So I guess a value for x, and then I use that value to find r and f. Then I go back and use the value I found for r and f to find x. I keep doing this until the guess and the calculated are the same.
My question is:
How do I get the computer to do this? I've been using mathcad, but an example in another language like C++ is fine.

The very first thing you should do faced with iterative algorithms is write down on paper the sequence that will result from your idea:
Eg.:
x_0 = ..., f_0 = ..., r_0 = ...
x_1 = ..., f_1 = ..., r_1 = ...
...
x_n = ..., f_n = ..., r_n = ...
Now, you have an idea of what you should implement (even if you don't know how). If you don't manage to find a closed form expression for one of the x_i, r_i or whatever_i, you will need to solve one dimensional equations numerically. This will imply more work.
Now, for the implementation part, if you never wrote a program, you should seriously ask someone live who can help you (or hire an intern and have him write the code). We cannot help you beginning from scratch with, eg. C programming, but we are willing to help you with specific problems which should arise when you write the program.
Please note that your algorithm is not guaranteed to converge, even if you strongly think there is a unique solution. Solving non linear equations is a difficult subject.

It appears that mathcad has many abstractions for iterative algorithms without the need to actually implement them directly using a "lower level" language. Perhaps this question is better suited for the mathcad forums at:
http://communities.ptc.com/index.jspa

If you are using Mathcad, it has the functionality built in. It is called solve block.
Start with the keyword "given"
Given
define the guess values for all unknowns
x:=2
f:=3
r:=2
...
define your constraints
x = √((y-z)×2/r)
z = f×(L/D)×(x/2g)
f = something crazy with x in it
etc…(there are more equations with x in it)
calculate the solution
find(x, y, z, r, ...)=
Check Mathcad help or Quicksheets for examples of the exact syntax.

The simple answer to your question is this pseudo-code:
X = startingX;
lastF = Infinity;
F = 0;
tolerance = 1e-10;
while ((lastF - F)^2 > tolerance)
{
lastF = F;
X = ?;
R = ?;
F = FunctionOf(X,R);
}
This may not do what you expect at all. It may give a valid but nonsense answer or it may loop endlessly between alternate wrong answers.
This is standard substitution to convergence. There are more advanced techniques like DIIS but I'm not sure you want to go there. I found this article while figuring out if I want to go there.
In general, it really pays to think about how you can transform your problem into an easier problem.
In my experience it is better to pose your problem as a univariate bounded root-finding problem and use Brent's Method if you can
Next worst option is multivariate minimization with something like BFGS.
Iterative solutions are horrible, but are more easily solved once you think of them as X2 = f(X1) where X is the input vector and you're trying to reduce the difference between X1 and X2.

As the commenters have noted, the mathematical aspects of your question are beyond the scope of the help you can expect here, and are even beyond the help you could be offered based on the detail you posted.
However, I think that even if you understood the mathematics thoroughly there are computer science aspects to your question that should be addressed.
When you write your code, try to make organize it into functions that depend only upon the parameters you are passing in to a subroutine. So write a subroutine that takes in values for y, z, and r and returns you x. Make another that takes in f,L,D,G and returns z. Now you have testable routines that you can check to make sure they are computing correctly. Check the input values to your routines in the routines - for instance in computing x you will get a divide by 0 error if you pass in a 0 for r. Think about how you want to handle this.
If you are going to solve this problem interatively you will need a method that will decide, based on the results of one iteration, what the values for the next iteration will be. This also should be encapsulated within a subroutine. Now if you are using a language that allows only one value to be returned from a subroutine (which is most common computation languages C, C++, Java, C#) you need to package up all your variables into some kind of data structure to return them. You could use an array of reals or doubles, but it would be nicer to choose to make an object and then you can reference the variables by their name and not their position (less chance of error).
Another aspect of iteration is knowing when to stop. Certainly you'll do so when you get a solution that converges. Make this decision into another subroutine. Now when you need to change the convergence criteria there is only one place in the code to go to. But you need to consider other reasons for stopping - what do you do if your solution starts diverging instead of converging? How many iterations will you allow the run to go before giving up?
Another aspect of iteration of a computer is round-off error. Mathematically 10^40/10^38 is 100. Mathematically 10^20 + 1 > 10^20. These statements are not true in most computations. Your calculations may need to take this into account or you will end up with numbers that are garbage. This is an example of a cross-cutting concern that does not lend itself to encapsulation in a subroutine.
I would suggest that you go look at the Python language, and the pythonxy.com extensions. There are people in the associated forums that would be a good resource for helping you learn how to do iterative solving of a system of equations.

Is Scala functional programming slower than traditional coding?

In one of my first attempts to create functional code, I ran into a performance issue.
I started with a common task - multiply the elements of two arrays and sum up the results:
var first:Array[Float] ...
var second:Array[Float] ...
var sum=0f;
for (ix<-0 until first.length)
sum += first(ix) * second(ix);
Here is how I reformed the work:
sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
When I benchmarked the two approaches, the second method takes 40 times as long to complete!
Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?

The main reasons why these two examples are so different in speed are:
the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
the faster one doesn't create temporary collections and, thus, avoids extra memory copies.
Let's consider the slower one by parts. First:
first.zip(second)
That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.
map{ case (a,b) => a*b }
Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).
We are not finished, though. Here's the next part:
reduceLeft(_+_)
That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.
Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:
sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:
sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:
sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.
The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.
So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.

I did some variations of this with Scala 2.8. The loop version is as you write but the
functional version is slightly different:
(xs, ys).zipped map (_ * _) reduceLeft(_ + _)
I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure
the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).
loopArray 461 437 436 437 435
reduceArray 6573 6544 6718 6828 6554
loopVector 5877 5773 5775 5791 5657
reduceVector 5064 4880 4844 4828 4926
loopArrayBoxed 2627 2551 2569 2537 2546
reduceArrayBoxed 4809 4434 4496 4434 4365
loopVectorBoxed 7577 7450 7456 7463 7432
reduceVectorBoxed 5116 4903 5006 4957 5122
The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.
If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.
If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.
So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.

This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,
zip . map . fold
Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.
Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.
There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.
So, in summary:
naive translations may have unexpected copies and inefficiences
clever FP compilers remove this overhead (but Scala can't yet)
sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it

Don Stewart has a fine answer, but it might not be obvious how going from one loop to three creates a factor of 40 slowdown. I'll add to his answer that Scala compiles to JVM bytecodes, and not only does the Scala compiler not fuse the three loops into one, but the Scala compiler is almost certainly allocating all the intermediate arrays. Notoriously, implementations of the JVM are not designed to handle the allocation rates required by functional languages. Allocation is a significant cost in functional programs, and that's one the loop-fusion transformations that Don Stewart and his colleagues have implemented for Haskell are so powerful: they eliminate lots of allocations. When you don't have those transformations, plus you're using an expensive allocator such as is found on a typical JVM, that's where the big slowdown comes from.
Scala is a great vehicle for experimenting with the expressive power of an unusual mix of language ideas: classes, mixins, modules, functions, and so on. But it's a relatively young research language, and it runs on the JVM, so it's unreasonable to expect great performance except on the kind of code that JVMs are good at. If you want to experiment with the mix of language ideas that Scala offers, great—it's a really interesting design—but don't expect the same performance on pure functional code that you'd get with a mature compiler for a functional language, like GHC or MLton.
Is scala functional programming slower than traditional coding?
Not necessarily. Stuff to do with first-class functions, pattern matching, and currying need not be especially slow. But with Scala, more than with other implementations of other functional languages, you really have to watch out for allocations—they can be very expensive.

The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.
That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:
class FastFloatOps(a: Array[Float]) {
def fastMapOnto(f: Float => Float) = {
var i = 0
while (i < a.length) { a(i) = f(a(i)); i += 1 }
this
}
def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
val len = a.length min b.length
val c = new Array[Float](len)
var i = 0
while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
c
}
def fastReduce(f: (Float,Float) => Float) = {
if (a.length==0) Float.NaN
else {
var r = a(0)
var i = 1
while (i < a.length) { r = f(r,a(i)); i += 1 }
r
}
}
}
implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)
and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)
Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.

I am not an expert Scala programmer, so there is probably a more efficient method, but what about something like this. This can be tail call optimized, so performance should be OK.
def multiply_and_sum(l1:List[Int], l2:List[Int], sum:Int):Int = {
if (l1 != Nil && l2 != Nil) {
multiply_and_sum(l1.tail, l2.tail, sum + (l1.head * l2.head))
}
else {
sum
}
}
val first = Array(1,2,3,4,5)
val second = Array(6,7,8,9,10)
multiply_and_sum(first.toList, second.toList, 0) //Returns: 130

To answer the question in the title: Simple functional constructs may be slower than imperative on the JVM.
But, if we consider only simple constructs, then we might as well throw out all modern languages and stick with C or assembler. If you look a the programming language shootout, C always wins.
So why choose a modern language? Because it lets you express a cleaner design. Cleaner design leads to performance gains in the overall operation of the application. Even if some low-level methods can be slower. One of my favorite examples is the performance of BuildR vs. Maven. BuildR is written in Ruby, an interpreted, slow, language. Maven is written in Java. A build in BuildR is twice as fast as Maven. This is due mostly to the design of BuildR which is lightweight compared with that of Maven.

Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:
let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )
may be rewritten using fold2 so as to avoid all temporary data structures:
let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys
This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.

Here is dbyrnes solution with arrays (assuming Arrays are to be used) and just iterating over the index:
def multiplyAndSum (l1: Array[Int], l2: Array[Int]) : Int =
{
def productSum (idx: Int, sum: Int) : Int =
if (idx < l1.length)
productSum (idx + 1, sum + (l1(idx) * l2(idx))) else
sum
if (l2.length == l1.length)
productSum (0, 0) else
error ("lengths don't fit " + l1.length + " != " + l2.length)
}
val first = (1 to 500).map (_ * 1.1) toArray
val second = (11 to 510).map (_ * 1.2) toArray
def loopi (n: Int) = (1 to n).foreach (dummy => multiplyAndSum (first, second))
println (timed (loopi (100*1000)))
That needs about 1/40 of the time of the list-approach. I don't have 2.8 installed, so you have to test #tailrec yourself. :)

Optimizing the computation of a recursive sequence

What is the fastest way in R to compute a recursive sequence defined as
x[1] <- x1
x[n] <- f(x[n-1])
I am assuming that the vector x of proper length is preallocated. Is there a smarter way than just looping?
Variant: extend this to vectors:
x[,1] <- x1
x[,n] <- f(x[,n-1])

Solve the recurrence relationship ;)

In terms of the question of whether this can be fully "vectorized" in any way, I think the answer is probably "no". The fundamental idea behind array programming is that operations apply to an entire set of values at the same time. Similarly for questions of "embarassingly parallel" computation. In this case, since your recursive algorithm depends on each prior state, there would be no way to gain speed from parallel processing: it must be run serially.
That being said, the usual advice for speeding up your program applies. For instance, do as much of the calculation outside of your recursive function as possible. Sort everything. Predefine your array lengths so they don't have to grow during the looping. Etc. See this question for a similar discussion. There is also a pseudocode example in Tim Hesterberg's article on efficient S-Plus Programming.

You could consider writing it in C / C++ / Fortran and use the handy inline package to deal with the compiling, linking and loading for you.
Of course, your function f() may be a real constraint if that one needs to remain an R function. There is a callback-from-C++-to-R example in Rcpp but this requires a bit more work than just using inline.

Well if you need the entire sequence how fast it can be? assuming that the function is O(1), you cannot do better than O(n), and looping through will give you just that.

In general, the syntax x$y <- f(z) will have to reallocate x every time, which would be very slow if x is a large object. But, it turns out that R has some tricks so that the list replacement function [[<- doesn't reallocate the whole list every time. So I think you can reasonably efficiently do:
x[[1]] <- x1
for (m in seq(2, n))
x[[m]] <- f(x[[m-1]])
The only wasteful aspect here is that you have to generate an array of length n-1 for the for loop, which isn't ideal, but it's probably not a giant issue. You could replace it by a while loop if you preferred. The usual vectorization tricks (lapply, etc.) won't work here...
(The double brackets give you a list element, which is what you probably want, rather than a singleton list.)
For more details, see Chambers (2008). Software for Data Analysis. p. 473-474.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio