Functional languages lead to use of recursion to solve a lot of problems, and therefore many of them perform Tail Call Optimization (TCO). TCO causes calls to a function from another function (or itself, in which case this feature is also known as Tail Recursion Elimination, which is a subset of TCO), as the last step of that function, to not need a new stack frame, which decreases overhead and memory usage.
Ruby obviously has "borrowed" a number of concepts from functional languages (lambdas, functions like map and so forth, etc.), which makes me curious: Does Ruby perform tail call optimization?

No, Ruby doesn't perform TCO. However, it also doesn't not perform TCO.
The Ruby Language Specification doesn't say anything about TCO. It doesn't say you have to do it, but it also doesn't say you can't do it. You just can't rely on it.
This is unlike Scheme, where the Language Specification requires that all Implementations must perform TCO. But it is also unlike Python, where Guido van Rossum has made it very clear on multiple occasions (the last time just a couple of days ago) that Python Implementations should not perform TCO.
Yukihiro Matsumoto is sympathetic to TCO, he just doesn't want to force all Implementations to support it. Unfortunately, this means that you cannot rely on TCO, or if you do, your code will no longer be portable to other Ruby Implementations.
So, some Ruby Implementations perform TCO, but most don't. YARV, for example, supports TCO, although (for the moment) you have to explicitly uncomment a line in the source code and recompile the VM, to activate TCO – in future versions it is going to be on by default, after the implementation proves stable. The Parrot Virtual Machine supports TCO natively, therefore Cardinal could quite easily support it, too. The CLR has some support for TCO, which means that IronRuby and Ruby.NET could probably do it. Rubinius could probably do it, too.
But JRuby and XRuby don't support TCO, and they probably won't, unless the JVM itself gains support for TCO. The problem is this: if you want to have a fast implementation, and fast and seamless integration with Java, then you should be stack-compatible with Java and use the JVM's stack as much as possible. You can quite easily implement TCO with trampolines or explicit continuation-passing style, but then you are no longer using the JVM stack, which means that everytime you want to call into Java or call from Java into Ruby, you have to perform some kind of conversion, which is slow. So, XRuby and JRuby chose to go with speed and Java integration over TCO and continuations (which basically have the same problem).
This applies to all implementations of Ruby that want to tightly integrate with some host platform that doesn't support TCO natively. For example, I guess MacRuby is going to have the same problem.

Update: Here's nice explanation of TCO in Ruby:
Update: You might want also check out the tco_method gem:
In Ruby MRI (1.9, 2.0 and 2.1) you can turn TCO on with:
RubyVM::InstructionSequence.compile_option = {
:tailcall_optimization => true,
:trace_instruction => false
There was a proposal to turn TCO on by default in Ruby 2.0. It also explains some issues that come with that: Tail call optimization: enable by default?.
Short excerpt from the link:
Generally, tail-recursion optimization includes another optimization technique - "call" to "jump" translation. In my opinion,
it is difficult to apply this optimization because recognizing
"recursion" is difficult in Ruby's world.
Next example. fact() method invocation in "else" clause is not a "tail
def fact(n)
if n < 2
n * fact(n-1)
If you want to use tail-call optimization on fact() method, you need
to change fact() method as follows (continuation passing style).
def fact(n, r)
if n < 2
fact(n-1, n*r)

It can have but is not guaranteed to:

TCO can also be compiled in by tweaking a couple variables in vm_opts.h before compiling:
// vm_opts.h
#define OPT_TRACE_INSTRUCTION 0 // default 1
#define OPT_TAILCALL_OPTIMIZATION 1 // default 0

This builds on Jörg's and Ernest's answers.
Basically it depends on implementation.
I couldn't get Ernest's answer to work on MRI, but it is doable.
I found this example that works for MRI 1.9 to 2.1. This should print a very large number. If you don't set TCO option to true, you should get the "stack too deep" error.
source = <<-SOURCE
def fact n, acc = 1
fact n - 1, acc * n
fact 10000
i_seq = source, nil, nil, nil,
tailcall_optimization: true, trace_instruction: false
#puts i_seq.disasm
value = i_seq.eval
p value
rescue SystemStackError => e
p e


Ruby performance in .each loops

Consider the following two peices of ruby code
Example 1
name = user.first_name
round_number = rounds.count
users.each do |u|
puts "#{name} beat #{u.first_name} in round #{round_number}"
Example 2
users.each do |u|
puts "#{user.first_name} beat #{u.first_name} in #{rounds.count}"
For both pieces of code imagine
def first_name
So in a classical analysis of algorithms, the first piece of code would be more efficient, however in most modern compiled languages, modern compilers would optimize the second piece of code to make it look like the first, eliminating the need to optimize code in such maner.
Will ruby optimize or cache values for this code before execution? Should my ruby code look like example 1 or example 2?
Example 1 will run faster, as first_name() is only called once, and it's value stored in the variable.
In Example 2 Ruby will not memoize this value automatically, since the value could have changed between iterations for the each() loop.
Therefor expensive-to-calculate methods should be explicitly memoized if they are expected to be used more than once without the return value changing.
Making use of Ruby's Benchmark Module can be useful when making decisions like this. It will likely only be worth memoizing if there are a lot of values in users, or if first_name() is expensive to calculate.
A compiler can only perform this optimization if it can prove that the method has no side effects. This is even more difficult in Ruby than most languages, as everything is mutable and can be overridden at runtime. Whether it happens or not is implementation dependent, but since it's hard to do in Ruby, most do not. I actually don't know of any that do at the time of this posting.

Why does Ruby's Fixnum#/ method round down when it is called on another Fixnum?

Okay, so what's up with this?
irb(main):001:0> 4/3
=> 1
irb(main):002:0> 7/8
=> 0
irb(main):003:0> 5/2
=> 2
I realize Ruby is doing integer division here, but why? With a langauge as flexible as Ruby, why couldn't 5/2 return the actual, mathematical result of 5/2? Is there some common use for integer division that I'm missing? It seems to me that making 7/8 return 0 would cause more confusion than any good that might come from it is worth. Is there any real reason why Ruby does this?
Because most languages (even advanced/high-level ones) in creation do it? You will have the same behaviour on integer in C, C++, Java, Perl, Python... This is Euclidian Division (hence the corresponding modulo % operator).
The integer division operation is even implemented at hardware level on many architecture. Others have asked this question, and one reason is symetry: In static typed languages such as see, this allows all integer operations to return integers, without loss of precision. It also allow easy access to the corresponding low-level assembler operation, since C was designed as a sort of extension layer over it.
Moreover, as explained in one comment to the linked article, floating point operations were costly (or not supported on all architectures) for many years, and not required for processes such as splitting a dataset in fixed lots.

What are some obvious optimizations for a virtual machine implementing a functional language?

I'm working on an intermediate language and a virtual machine to run a functional language with a couple of "problematic" properties:
Lexical namespaces (closures)
Dynamically growing call stack
A slow integer type (bignums)
The intermediate language is stack based, with a simple hash-table for the current namespace. Just so you get an idea of what it looks like, here's the McCarthy91 function:
# McCarthy 91: M(n) = n - 10 if n > 100 else M(M(n + 11))
.sub M
sto n
rcl n
float 100
rcl n
float 10
rcl n
float 11
list 1
rcl M
list 1
rcl M
The "big loop" is straightforward:
fetch an instruction
increment the instruction pointer (or program counter)
evaluate the instruction
Along with sto, rcl and a whole lot more, there are three instructions for function calls:
call copies the namespace (deep copy) and pushes the instruction pointer onto the call stack
call-fast is the same, but only creates a shallow copy
tail is basically a 'goto'
The implementation is really straightforward. To give you a better idea, here's just a random snippet from the middle of the "big loop" (updated, see below)
} else if inst == 2 /* STO */ {
local[data] = stack[len(stack) - 1]
if code[ip + 1][0] != 3 {
stack = stack[:len(stack) - 1]
} else {
} else if inst == 3 /* RCL */ {
stack = append(stack, local[data])
} else if inst == 12 /* .END */ {
outer = outer[:len(outer) - 1]
ip = calls[len(calls) - 1]
calls = calls[:len(calls) - 1]
} else if inst == 20 /* CALL */ {
calls = append(calls, ip)
cp := make(Local, len(local))
copy(cp, local)
outer = append(outer, &cp)
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
} else if inst == 21 /* TAIL */ {
x := stack[len(stack) - 1]
stack = stack[:len(stack) - 1]
ip = x.(int)
The problem is this: Calling McCarthy91 16 times with a value of -10000 takes, near as makes no difference, 3 seconds (after optimizing away the deep-copy, which adds nearly a second).
My question is: What are some common techniques for optimizing interpretation of this kind of language? Is there any low-hanging fruit?
I used slices for my lists (arguments, the various stacks, slice of maps for the namespaces, ...), so I do this sort of thing all over the place: call_stack[:len(call_stack) - 1]. Right now, I really don't have a clue what pieces of code make this program slow. Any tips will be appreciated, though I'm primarily looking for general optimization strategies.
I can reduce execution time quite a bit by circumventing my calling conventions. The list <n> instruction fetches n arguments of the stack and pushes a list of them back onto the stack, the args instruction pops off that list and pushes each item back onto the stack. This is firstly to check that functions are called with the correct number of arguments and secondly to be able to call functions with variable argument-lists (i.e. (defun f x:xs)). Removing that, and also adding an instruction sto* <x>, which replaces sto <x>; rcl <x>, I can get it down to 2 seconds. Still not brilliant, and I have to have this list/args business anyway. :)
Another aside (this is a long question I know, sorry):
Profiling the program with pprof told me very little (I'm new to Go in case that's not obvious) :-). These are the top 3 items as reported by pprof:
16 6.1% 6.1% 16 6.1% sweep pkg/runtime/mgc0.c:745
9 3.4% 9.5% 9 3.4% fmt.(*fmt).fmt_qc pkg/fmt/format.go:323
4 1.5% 13.0% 4 1.5% fmt.(*fmt).integer pkg/fmt/format.go:248
These are the changes I've made so far:
I removed the hash table. Instead I'm now passing just pointers to arrays, and I only efficiently copy the local scope when I have to.
I replaced the instruction names with integer opcodes. Before, I've wasted quite a bit of time comparing strings.
The call-fast instruction is gone (the speedup wasn't measurable anymore after the other changes)
Instead of having "int", "float" and "str" instructions, I just have one eval and I evaluate the constants at compile time (compilation of the bytecode that is). Then eval just pushes a reference to them.
After changing the semantics of .if, I could get rid of these pseudo-functions. it's now .if, .else and .endif, with implicit gotos ànd block-semantics similar to .sub. (some example code)
After implementing the lexer, parser, and bytecode compiler, the speed went down a little bit, but not terribly so. Calculating MC(-10000) 16 times makes it evaluate 4.2 million bytecode instructions in 1.2 seconds. Here's a sample of the code it generates (from this).
The whole thing is on github
There are decades of research on things you can optimize:
Implementing functional languages: a tutorial, Simon Peyton Jones and David Lester. Published by Prentice Hall, 1992.
Practical Foundations for Programming Languages, Robert Harper
You should have efficient algorithmic representations for the various concepts of your interpreter. Doing deep copies on a hashtable looks like a terrible idea, but I see that you have already removed that.
(Yes, your stack-popping operation using array slices look suspect. You should make sure they really have the expected algorithmic complexity, or else use a dedicated data structure (... a stack). I'm generally wary of using all-purposes data structure such as Python lists or PHP hashtables for this usage, because they are not necessarily designed to handle this particular use case well, but it may be that slices do guarantee an O(1) pushing and popping cost under all circumstances.)
The best way to handle environments, as long as they don't need to be reified, is to use numeric indices instead of variables (de Bruijn indices (0 for the variable bound last), or de Bruijn levels (0 for the variable bound first). This way you can only keep a dynamically resized array for the environment and accessing it is very fast. If you have first-class closures you will also need to capture the environment, which will be more costly: you have to copy the part of it in a dedicated structure, or use a non-mutable structure for the whole environment. Only experiment will tell, but my experience is that going for a fast mutable environment structure and paying a higher cost for closure construction is better than having an immutable structure with more bookkeeping all the time; of course you should make an usage analysis to capture only the necessary variables in your closures.
Finally, once you have rooted out the inefficiency sources related to your algorithmic choices, the hot area will be:
garbage collection (definitely a hard topic; if you don't want to become a GC expert, you should seriously consider reusing an existing runtime); you may be using the GC of your host language (heap-allocations in your interpreted language are translated into heap-allocations in your implementation language, with the same lifetime), it's not clear in the code snippet you've shown; this strategy is excellent to get something reasonably efficient in a simple way
numerics implementation; there are all kind of hacks to be efficient when the integers you manipulate are in fact small. Your best bet is to reuse the work of people that have invested tons of effort on this, so I strongly recommend you reuse for example the GMP library. Then again, you may also reuse your host language support for bignum if it has some, in your case Go's math/big package.
the low-level design of your interpreter loop. In a language with "simple bytecode" such as yours (each bytecode instruction is translated in a small number of native instructions, as opposed to complex bytecodes having high-level semantics such as the Parrot bytecode), the actual "looping and dispatching on bytecodes" code can be a bottleneck. There has been quite a lot of research on what's the best way to write such bytecode dispatch loops, to avoid the cascade of if/then/else (jump tables), benefit from the host processor branch prediction, simplify the control flow, etc. This is called threaded code and there are a lot of (rather simple) different techniques : direct threading, indirect threading... If you want to look into some of the research, there is for example work by Anton Ertl, The Structure and Performance of Efficient Interpreters in 2003, and later Context threading: A flexible and efficient dispatch technique for virtual machine interpreters. The benefits of those techniques tend to be fairly processor-sensitive, so you should test the various possibilities yourself.
While the STG work is interesting (and Peyton-Jones book on programming language implementation is excellent), it is somewhat oriented towards lazy evaluation. Regarding the design of efficient bytecode for strict functional languages, my reference is Xavier Leroy's 1990 work on the ZINC machine: The ZINC experiment: An Economical Implementation of the ML Language, which was ground-breaking work for the implementation of ML languages, and is still in use in the implementation of the OCaml language: there are both a bytecode and a native compiler, but the bytecode still uses a glorified ZINC machine.

Scala performance question

In the article written by Daniel Korzekwa, he said that the performance of following code: => e*2).filter(e => e>10)
is much worse than the iterative solution written using Java.
Can anyone explain why? And what is the best solution for such code in Scala (I hope it's not a Java iterative version which is Scala-fied)?
The reason that particular code is slow is because it's working on primitives but it's using generic operations, so the primitives have to be boxed. (This could be improved if List and its ancestors were specialized.) This will probably slow things down by a factor of 5 or so.
Also, algorithmically, those operations are somewhat expensive, because you make a whole list, and then make a whole new list throwing a few components of the intermediate list away. If you did it in one swoop, then you'd be better off. You could do something like:
list collect (case e if (e*2>10) => e*2)
but what if the calculation e*2 is really expensive? Then you could
(List[Int]() /: list)((ls,e) => { val x = e*2; if (x>10) x :: ls else ls }
except that this would appear backwards. (You could reverse it if need be, but that requires creating a new list, which again isn't ideal algorithmically.)
Of course, you have the same sort of algorithmic problems in Java if you're using a singly linked list--your new list will end up backwards, or you have to create it twice, first in reverse and then forwards, or you have to build it with (non-tail) recursion (which is easy in Scala, but inadvisable for this sort of thing in either language since you'll exhaust the stack), or you have to create a mutable list and then pretend afterwards that it's not mutable. (Which, incidentally, you can do in Scala also--see mutable.LinkedList.)
Basically it's traversing a list twice. Once for multiplying every element with two. And then another time to filter the results.
Think of it as one while loop producing a LinkedList with the intermediate results. And then another loop applying the filter to produce the final results.
This should be faster: => e * 2).filter(e => e > 10).force
The solution lies mostly with JVM. Though Scala has a workaround in the figure of #specialization, that increases the size of any specialized class hugely, and only solves half the problem -- the other half being the creation of temporary objects.
The JVM actually does a good job optimizing a lot of it, or the performance would be even more terrible, but Java does not require the optimizations that Scala does, so JVM does not provide them. I expect that to change to some extent with the introduction of SAM not-real-closures in Java.
But, in the end, it comes down to balancing the needs. The same while loop that Java and Scala do so much faster than Scala's function equivalent can be done faster yet in C. Yet, despite what the microbenchmarks tell us, people use Java.
Scala approach is much more abstract and generic. Therefore it is hard to optimize every single case.
I could imagine that HotSpot JIT compiler might apply stream- and loop-fusion to the code in the future if it sees that the immediate results are not used.
Additionally the Java code just does much more.
If you really just want to mutate over a datastructure, consider transform.
It looks a bit like map but doesn't create a new collection, e. g.:
val array = Array(1,2,3,4,5,6,7,8,9,10).transform(_ * 2)
// array is now WrappedArray(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
I really hope some additional in-place operations will be added ion the future...
To avoid traversing the list twice, I think the for syntax is a nice option here:
val list2 = for(v <- list1; e = v * 2; if e > 10) yield e
Rex Kerr correctly states the major problem: Operating on immutable lists, the stated piece of code creates intermediate lists in memory. Note that this is not necessarily slower than equivalent Java code; you just never use immutable datastructures in Java.
Wilfried Springer has a nice, Scala idomatic solution. Using view, no (manipulated) copies of the whole list are created.
Note that using view might not always be ideal. For example, if your first call is filter that is expected to throw away most of the list, is might be worthwhile to create the shorter version explicitly and use view only after that in order to improve memory locality for later iterations.
list.filter(e => e*2>10).map(e => e*2)
This attempt reduces first the List. So the second traversing is on less elements.

Does Scala support tail recursion optimization?

Does Scala support tail recursion optimization?
Scala does tail recursion optimisation at compile-time, as other posters have said. That is, a tail recursive function is transformed into a loop by the compiler (a method invoke is transformed into a jump), as can be seen from the stack trace when running a tail recursive function.
Try the following snippet:
def boom(n: Int): Nothing = if(n<=0) throw new Exception else boom(n-1)
and inspect the stack trace. It will show only one call to the function boom - therefore the compiled bytecode is not recursive.
There is a proposal floating around to implement tail calls at the JVM level - which in my opinion would a great thing to do, as then the JVM could do runtime optimizations, rather than just compile time optimizations of the code - and could possibly mean more flexible tail recursion. Basically a tailcall invoke would behave exactly like a normal method invoke but will drop the stack of the caller when it's safe to do so - the specification of the JVM states that stack frames must be preserved, so the JIT has to do some static code analysis to find out if the stack frames are never going to be used.
The current status of it is proto 80%. I don't think it will be done in time for Java 7 (invokedynamic has a greater priority, and the implementation is almost done) but Java 8 might see it implemented.
In Scala 2.8 you can use #tailrec to mark specific method that you expect the compiler will optimise:
import scala.annotation.tailrec
#tailrec def factorialAcc(acc: Int, n: Int): Int = {
if (n <= 1) acc
else factorialAcc(n * acc, n - 1)
If a method can not be optimized you get a compile-time error.
Scala 2.7.x supports tail-call optimization for self-recursion (a function calling itself) of final methods and local functions.
Scala 2.8 might come with library support for trampoline too, which is a technique to optimize mutually recursive functions.
A good deal of information about the state of Scala recursion can be found in Rich Dougherty's blog.
Only in very simple cases where the function is self-recursive.
Proof of tail recursion ability.
It looks like Scala 2.8 might be improving tail-recursion recognition, though.
