What's the (hidden) cost of Scala's lazy val? - performance

One handy feature of Scala is lazy val, where the evaluation of a val is delayed until it's necessary (at first access).
Of course, a lazy val must have some overhead - somewhere Scala must keep track of whether the value has already been evaluated and the evaluation must be synchronized, because multiple threads might try to access the value for the first time at the same time.
What exactly is the cost of a lazy val - is there a hidden boolean flag associated with a lazy val to keep track if it has been evaluated or not, what exactly is synchronized and are there any more costs?
In addition, suppose I do this:
class Something {
lazy val (x, y) = { ... }
}
Is this the same as having two separate lazy vals x and y or do I get the overhead only once, for the pair (x, y)?

This is taken from the scala mailing list and gives implementation details of lazy in terms of Java code (rather than bytecode):
class LazyTest {
lazy val msg = "Lazy"
}
is compiled to something equivalent to the following Java code:
class LazyTest {
public int bitmap$0;
private String msg;
public String msg() {
if ((bitmap$0 & 1) == 0) {
synchronized (this) {
if ((bitmap$0 & 1) == 0) {
synchronized (this) {
msg = "Lazy";
}
}
bitmap$0 = bitmap$0 | 1;
}
}
return msg;
}
}

It looks like the compiler arranges for a class-level bitmap int field to flag multiple lazy fields as initialized (or not) and initializes the target field in a synchronized block if the relevant xor of the bitmap indicates it is necessary.
Using:
class Something {
lazy val foo = getFoo
def getFoo = "foo!"
}
produces sample bytecode:
0 aload_0 [this]
1 getfield blevins.example.Something.bitmap$0 : int [15]
4 iconst_1
5 iand
6 iconst_0
7 if_icmpne 48
10 aload_0 [this]
11 dup
12 astore_1
13 monitorenter
14 aload_0 [this]
15 getfield blevins.example.Something.bitmap$0 : int [15]
18 iconst_1
19 iand
20 iconst_0
21 if_icmpne 42
24 aload_0 [this]
25 aload_0 [this]
26 invokevirtual blevins.example.Something.getFoo() : java.lang.String [18]
29 putfield blevins.example.Something.foo : java.lang.String [20]
32 aload_0 [this]
33 aload_0 [this]
34 getfield blevins.example.Something.bitmap$0 : int [15]
37 iconst_1
38 ior
39 putfield blevins.example.Something.bitmap$0 : int [15]
42 getstatic scala.runtime.BoxedUnit.UNIT : scala.runtime.BoxedUnit [26]
45 pop
46 aload_1
47 monitorexit
48 aload_0 [this]
49 getfield blevins.example.Something.foo : java.lang.String [20]
52 areturn
53 aload_1
54 monitorexit
55 athrow
Values initialed in tuples like lazy val (x,y) = { ... } have nested caching via the same mechanism. The tuple result is lazily evaluated and cached, and an access of either x or y will trigger the tuple evaluation. Extraction of the individual value from the tuple is done independently and lazily (and cached). So the above double-instantiation code generates an x, y, and an x$1 field of type Tuple2.

With Scala 2.10, a lazy value like:
class Example {
lazy val x = "Value";
}
is compiled to byte code that resembles the following Java code:
public class Example {
private String x;
private volatile boolean bitmap$0;
public String x() {
if(this.bitmap$0 == true) {
return this.x;
} else {
return x$lzycompute();
}
}
private String x$lzycompute() {
synchronized(this) {
if(this.bitmap$0 != true) {
this.x = "Value";
this.bitmap$0 = true;
}
return this.x;
}
}
}
Note that the bitmap is represented by a boolean. If you add another field, the compiler will increase the size of the field to being able to represent at least 2 values, i.e. as a byte. This just goes on for huge classes.
But you might wonder why this works? The thread-local caches must be cleared when entering a synchronized block such that the non-volatile x value is flushed into memory. This blog article gives an explanation.

Scala SIP-20 proposes a new implementation of lazy val, which is more correct but ~25% slower than the "current" version.
The proposed implementation looks like:
class LazyCellBase { // in a Java file - we need a public bitmap_0
public static AtomicIntegerFieldUpdater<LazyCellBase> arfu_0 =
AtomicIntegerFieldUpdater.newUpdater(LazyCellBase.class, "bitmap_0");
public volatile int bitmap_0 = 0;
}
final class LazyCell extends LazyCellBase {
import LazyCellBase._
var value_0: Int = _
#tailrec final def value(): Int = (arfu_0.get(this): #switch) match {
case 0 =>
if (arfu_0.compareAndSet(this, 0, 1)) {
val result = 0
value_0 = result
#tailrec def complete(): Unit = (arfu_0.get(this): #switch) match {
case 1 =>
if (!arfu_0.compareAndSet(this, 1, 3)) complete()
case 2 =>
if (arfu_0.compareAndSet(this, 2, 3)) {
synchronized { notifyAll() }
} else complete()
}
complete()
result
} else value()
case 1 =>
arfu_0.compareAndSet(this, 1, 2)
synchronized {
while (arfu_0.get(this) != 3) wait()
}
value_0
case 2 =>
synchronized {
while (arfu_0.get(this) != 3) wait()
}
value_0
case 3 => value_0
}
}
As of June 2013 this SIP hasn't been approved. I expect that it's likely to be approved and included in a future version of Scala based on the mailing list discussion. Consequently, I think you'd be wise to heed Daniel Spiewak's observation:
Lazy val is *not* free (or even cheap). Use it only if you absolutely
need laziness for correctness, not for optimization.

I've written a post with regard to this issue https://dzone.com/articles/cost-laziness
In nutshell, the penalty is so small that in practice you can ignore it.

given the bycode generated by scala for lazy, it can suffer thread safety problem as mentioned in double check locking http://www.javaworld.com/javaworld/jw-05-2001/jw-0525-double.html?page=1

Related

sum and max values in a single iteration

I have a List of a custom CallRecord objects
public class CallRecord {
private String callId;
private String aNum;
private String bNum;
private int seqNum;
private byte causeForOutput;
private int duration;
private RecordType recordType;
.
.
.
}
There are two logical conditions and the output of each is:
Highest seqNum, sum(duration)
Highest seqNum, sum(duration), highest causeForOutput
As per my understanding, Stream.max(), Collectors.summarizingInt() and so on will either require several iterations for the above result. I also came across a thread suggesting custom collector but I am unsure.
Below is the simple, pre-Java 8 code that is serving the purpose:
if (...) {
for (CallRecord currentRecord : completeCallRecords) {
highestSeqNum = currentRecord.getSeqNum() > highestSeqNum ? currentRecord.getSeqNum() : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
}
} else {
byte highestCauseForOutput = 0;
for (CallRecord currentRecord : completeCallRecords) {
highestSeqNum = currentRecord.getSeqNum() > highestSeqNum ? currentRecord.getSeqNum() : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
highestCauseForOutput = currentRecord.getCauseForOutput() > highestCauseForOutput ? currentRecord.getCauseForOutput() : highestCauseForOutput;
}
}
Your desire to do everything in a single iteration is irrational. You should strive for simplicity first, performance if necessary, but insisting on a single iteration is neither.
The performance depends on too many factors to make a prediction in advance. The process of iterating (over a plain collection) itself is not necessarily an expensive operation and may even benefit from a simpler loop body in a way that makes multiple traversals with a straight-forward operation more efficient than a single traversal trying to do everything at once. The only way to find out, is to measure using the actual operations.
Converting the operation to Stream operations may simplify the code, if you use it straight-forwardly, i.e.
int highestSeqNum=
completeCallRecords.stream().mapToInt(CallRecord::getSeqNum).max().orElse(-1);
int sumOfDuration=
completeCallRecords.stream().mapToInt(CallRecord::getDuration).sum();
if(!condition) {
byte highestCauseForOutput = (byte)
completeCallRecords.stream().mapToInt(CallRecord::getCauseForOutput).max().orElse(0);
}
If you still feel uncomfortable with the fact that there are multiple iterations, you could try to write a custom collector performing all operations at once, but the result will not be better than your loop, neither in terms of readability nor efficiency.
Still, I’d prefer avoiding code duplication over trying to do everything in one loop, i.e.
for(CallRecord currentRecord : completeCallRecords) {
int nextSeqNum = currentRecord.getSeqNum();
highestSeqNum = nextSeqNum > highestSeqNum ? nextSeqNum : highestSeqNum;
sumOfDuration += currentRecord.getDuration();
}
if(!condition) {
byte highestCauseForOutput = 0;
for(CallRecord currentRecord : completeCallRecords) {
byte next = currentRecord.getCauseForOutput();
highestCauseForOutput = next > highestCauseForOutput? next: highestCauseForOutput;
}
}
With Java-8 you can resolved it with a Collector with no redudant iteration.
Normally, we can use the factory methods from Collectors, but in your case you need to implement a custom Collector, that reduces a Stream<CallRecord> to an instance of SummarizingCallRecord which cotains the attributes you require.
Mutable accumulation/result type:
class SummarizingCallRecord {
private int highestSeqNum = 0;
private int sumDuration = 0;
// getters/setters ...
}
Custom collector:
BiConsumer<SummarizingCallRecord, CallRecord> myAccumulator = (a, callRecord) -> {
a.setHighestSeqNum(Math.max(a.getHighestSeqNum(), callRecord.getSeqNum()));
a.setSumDuration(a.getSumDuration() + callRecord.getDuration());
};
BinaryOperator<SummarizingCallRecord> myCombiner = (a1, a2) -> {
a1.setHighestSeqNum(Math.max(a1.getHighestSeqNum(), a2.getHighestSeqNum()));
a1.setSumDuration(a1.getSumDuration() + a2.getSumDuration());
return a1;
};
Collector<CallRecord, SummarizingCallRecord, SummarizingCallRecord> myCollector =
Collector.of(
() -> new SummarizinCallRecord(),
myAccumulator,
myCombiner,
// Collector.Characteristics.CONCURRENT/IDENTITY_FINISH/UNORDERED
);
Execution example:
List<CallRecord> callRecords = new ArrayList<>();
callRecords.add(new CallRecord(1, 100));
callRecords.add(new CallRecord(5, 50));
callRecords.add(new CallRecord(3, 1000));
SummarizingCallRecord summarizingCallRecord = callRecords.stream()
.collect(myCollector);
// Result:
// summarizingCallRecord.highestSeqNum = 5
// summarizingCallRecord.sumDuration = 1150
You don't need and should not implement the logic by Stream API because the tradition for-loop is simple enough and the Java 8 Stream API can't make it simpler:
int highestSeqNum = 0;
long sumOfDuration = 0;
byte highestCauseForOutput = 0; // just get it even if it may not be used. there is no performance hurt.
for(CallRecord currentRecord : completeCallRecords) {
highestSeqNum = Math.max(highestSeqNum, currentRecord.getSeqNum());
sumOfDuration += currentRecord.getDuration();
highestCauseForOutput = Math.max(highestCauseForOutput, currentRecord.getCauseForOutput());
}
// Do something with or without highestCauseForOutput.

Scala: call by name and call by value performance

Consider the following code:
class A(val name: String)
Compare the two wrappers of A:
class B(a: A){
def name: String = a.name
}
and
class B1(a: A){
val name: String = a.name
}
B and B1 have the same functionality. How is the memory efficiency and computation efficiency comparison between them? Will Scala compiler view them as the same thing?
First, I'll say that micro optimizations questions are usually tricky to answer for their lack of context. Additionally, this question has nothing to do with call by name or call by value, as non of your examples are call by name.
Now, let's compile your code with scalac -Xprint:typer and see what gets emitted:
class B extends scala.AnyRef {
<paramaccessor> private[this] val a: A = _;
def <init>(a: A): B = {
B.super.<init>();
()
};
def name: String = B.this.a.name
};
class B1 extends scala.AnyRef {
<paramaccessor> private[this] val a: A = _;
def <init>(a: A): B1 = {
B1.super.<init>();
()
};
private[this] val name: String = B1.this.a.name;
<stable> <accessor> def name: String = B1.this.name
};
In class B, we hold a reference to a, and have a method name which calls the name value on A.
In class B1, we store name locally, since it is a value of B1 directly, not a method. By definition, val declarations in have a method generated for them and that is how they're accessed.
This boils down to the fact that B1 holds an additional reference to the name string allocated by A. Is this significant in any way from a performance perspective? I don't know. It looks negligible to me under general question you've posted, but I wouldn't be to bothered with this unless you've profiled your application and found this a bottleneck.
Lets take this one step further, and run a simple JMH micro benchmark on this:
[info] Benchmark Mode Cnt Score Error Units
[info] MicroBenchClasses.testB1Access thrpt 50 296.291 ± 20.787 ops/us
[info] MicroBenchClasses.testBAccess thrpt 50 303.866 ± 5.435 ops/us
[info] MicroBenchClasses.testB1Access avgt 9 0.004 ± 0.001 us/op
[info] MicroBenchClasses.testBAccess avgt 9 0.003 ± 0.001 us/op
We see that call times are identical, since in both times we're invoking a method. One thing we can notice is that the throughput on B is higher, why is that? Lets look at the byte code:
B:
public java.lang.String name();
Code:
0: aload_0
1: getfield #20 // Field a:Lcom/testing/SimpleTryExample$A$1;
4: invokevirtual #22 // Method com/testing/SimpleTryExample$A$1.name:()Ljava/lang/String;
7: areturn
B1:
public java.lang.String name();
Code:
0: aload_0
1: getfield #19 // Field name:Ljava/lang/String;
4: areturn
It isn't trivial to understand why a getfield would be slower than a invokevirtual, but in the end the JIT may inline the getter call to name. This goes to show you that you should take nothing for granted, benchmark everything.
Code for test:
import java.util.concurrent.TimeUnit
import org.openjdk.jmh.annotations._
/**
* Created by Yuval.Itzchakov on 19/10/2017.
*/
#State(Scope.Thread)
#Warmup(iterations = 3, time = 1)
#Measurement(iterations = 3)
#BenchmarkMode(Array(Mode.AverageTime, Mode.Throughput))
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
class MicroBenchClasses {
class A(val name: String)
class B(a: A){
def name: String = a.name
}
class B1(a: A){
val name: String = a.name
}
var b: B = _
var b1: B1 = _
#Setup
def setup() = {
val firstA = new A("yuval")
val secondA = new A("yuval")
b = new B(firstA)
b1 = new B1(secondA)
}
#Benchmark
def testBAccess(): String = {
b.name
}
#Benchmark
def testB1Access(): String = {
b1.name
}
}

Does Inheritance in implicit value classes introduce an overhead?

I want to apply scala's value classes to one of my projects because they enable me to enrich certain primitive types without great overhead (I hope) and stay type-safe.
object Position {
implicit class Pos( val i: Int ) extends AnyVal with Ordered[Pos] {
def +( p: Pos ): Pos = i + p.i
def -( p: Pos ): Pos = if ( i - p.i < 0 ) 0 else i - p.i
def compare( p: Pos ): Int = i - p.i
}
}
My question: Will the inheritance of Ordered force the allocation of Pos objects whenever I use them (thereby introduce great overhead) or not? If so: Is there a way to circumvent this?
Everytime Pos will be treated as an Ordered[Pos], allocation will happen.
There are several cases when allocation has to happen, see http://docs.scala-lang.org/overviews/core/value-classes.html#when_allocation_is_necessary.
So when doing something as simple as calling <, you will get allocations:
val x = Pos( 1 )
val y = Pos( 2 )
x < y // x & y promoted to an actual instance (allocation)
The relevant rules are (quoted from the above article):
Whenever a value class is treated as another type, including a universal trait, an instance of the actual value class must be instantiated
and:
Another instance of this rule is when a value class is used as a type argument.
Disassembling the above code snippet confirms this:
0: aload_0
1: iconst_1
2: invokevirtual #21 // Method Pos:(I)I
5: istore_1
6: aload_0
7: iconst_2
8: invokevirtual #21 // Method Pos:(I)I
11: istore_2
12: new #23 // class test/Position$Pos
15: dup
16: iload_1
17: invokespecial #26 // Method test/Position$Pos."<init>":(I)V
20: new #23 // class test/Position$Pos
23: dup
24: iload_2
25: invokespecial #26 // Method test/Position$Pos."<init>":(I)V
28: invokeinterface #32, 2 // InterfaceMethod scala/math/Ordered.$less:(Ljava/lang/Object;)Z
As can be seen we do have two instances of the "new" opcode for class Position$Pos
UPDATE: to avoid the allocation in simples cases like this, you can manually override each method (even if they only forward to the originlal implementation):
override def < (that: Pos): Boolean = super.<(that)
override def > (that: Pos): Boolean = super.>(that)
override def <= (that: Pos): Boolean = super.<=(that)
override def >= (that: Pos): Boolean = super.>=(that)
This will remove the allocation when doing x < y by example.
However, this still leaves the cases when Pos is treated as an Ordered[Pos] (as when passed to a method taking a Ordered[Pos] or an Ordered[T] with T being a type parameter). In this particular case, you will still get an allocation and there no way around that.

Is this a bug in Scala 2.9.1 lazy implementation or just an artifact of decompilation

I am considering using Scala on a pretty computationally intensive program. Profiling the C++ version of our code reveals that we could benefit significantly from Lazy evaluation. I have tried it out in Scala 2.9.1 and really like it. However, when I ran the class through a decompiler the implemenation didn't look quite right. I'm assuming that it's an artifact of the decompiler, but I wanted to get a more conclusive answer...
consider the following trivial example:
class TrivialAngle(radians : Double)
{
lazy val sin = math.sin(radians)
}
when I decompile it, I get this:
import scala.ScalaObject;
import scala.math.package.;
import scala.reflect.ScalaSignature;
#ScalaSignature(bytes="omitted")
public class TrivialAngle
implements ScalaObject
{
private final double radians;
private double sin;
public volatile int bitmap$0;
public double sin()
{
if ((this.bitmap$0 & 0x1) == 0);
synchronized (this)
{
if (
(this.bitmap$0 & 0x1) == 0)
{
this.sin = package..MODULE$.sin(this.radians);
this.bitmap$0 |= 1;
}
return this.sin;
}
}
public TrivialAngle(double radians)
{
}
}
To me, the return block is in the wrong spot, and you will always acquire the lock. This can't be what the real code is doing, but I am unable to confirm this. Can anyone confirm or deny that I have a bogus decompilation, and that the lazy implementation is somewhat reasonable (ie, only locks when it is computing the value, and doesn't acquire the lock for subsequent calls?)
Thanks!
For reference, this is the decompiler I used:
http://java.decompiler.free.fr/?q=jdgui
scala -Xprint:jvm reveals the true story:
[[syntax trees at end of jvm]]// Scala source: lazy.scala
package <empty> {
class TrivialAngle extends java.lang.Object with ScalaObject {
#volatile protected var bitmap$0: Int = 0;
<paramaccessor> private[this] val radians: Double = _;
lazy private[this] var sin: Double = _;
<stable> <accessor> lazy def sin(): Double = {
if (TrivialAngle.this.bitmap$0.&(1).==(0))
{
TrivialAngle.this.synchronized({
if (TrivialAngle.this.bitmap$0.&(1).==(0))
{
TrivialAngle.this.sin = scala.math.`package`.sin(TrivialAngle.this.radians);
TrivialAngle.this.bitmap$0 = TrivialAngle.this.bitmap$0.|(1);
()
};
scala.runtime.BoxedUnit.UNIT
});
()
};
TrivialAngle.this.sin
};
def this(radians: Double): TrivialAngle = {
TrivialAngle.this.radians = radians;
TrivialAngle.super.this();
()
}
}
}
It's a (since JVM 1.5) safe, and very fast, double checked lock.
More details:
What's the (hidden) cost of Scala's lazy val?
Be aware that if you have multiple lazy val members in a class, only one of them can be initialized at once, as they are guarded by synchronized(this) { ... }.
What I get with javap -c does not correspond to your decompile. In particular, there is no monitor enter when the field is found to be initialized. Version 2.9.1 too. There is still the memory barrier implied by the volatile access of course, so it does not come completely free. Comments starting with /// are mine
public double sin();
Code:
0: aload_0
1: getfield #14; //Field bitmap$0:I
4: iconst_1
5: iand
6: iconst_0
7: if_icmpne 54 /// if getField & 1 == O goto 54, skip lock
10: aload_0
11: dup
12: astore_1
13: monitorenter
/// 14 to 52 reasonably equivalent to synchronized block
/// in your decompiled code, without the return
53: monitorexit
54: aload_0
55: getfield #27; //Field sin:D
58: dreturn /// return outside lock
59: aload_1 /// (this would be the finally implied by the lock)
60: monitorexit
61: athrow
Exception table:
from to target type
14 54 59 any

Scala stateful actor, recursive calling faster than using vars?

Sample code below. I'm a little curious why MyActor is faster than MyActor2. MyActor recursively calls process/react and keeps state in the function parameters whereas MyActor2 keeps state in vars. MyActor even has the extra overhead of tupling the state but still runs faster. I'm wondering if there is a good explanation for this or if maybe I'm doing something "wrong".
I realize the performance difference is not significant but the fact that it is there and consistent makes me curious what's going on here.
Ignoring the first two runs as warmup, I get:
MyActor:
559
511
544
529
vs.
MyActor2:
647
613
654
610
import scala.actors._
object Const {
val NUM = 100000
val NM1 = NUM - 1
}
trait Send[MessageType] {
def send(msg: MessageType)
}
// Test 1 using recursive calls to maintain state
abstract class StatefulTypedActor[MessageType, StateType](val initialState: StateType) extends Actor with Send[MessageType] {
def process(state: StateType, message: MessageType): StateType
def act = proc(initialState)
def send(message: MessageType) = {
this ! message
}
private def proc(state: StateType) {
react {
case msg: MessageType => proc(process(state, msg))
}
}
}
object MyActor extends StatefulTypedActor[Int, (Int, Long)]((0, 0)) {
override def process(state: (Int, Long), input: Int) = input match {
case 0 =>
(1, System.currentTimeMillis())
case input: Int =>
state match {
case (Const.NM1, start) =>
println((System.currentTimeMillis() - start))
(Const.NUM, start)
case (s, start) =>
(s + 1, start)
}
}
}
// Test 2 using vars to maintain state
object MyActor2 extends Actor with Send[Int] {
private var state = 0
private var strt = 0: Long
def send(message: Int) = {
this ! message
}
def act =
loop {
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
}
}
}
// main: Run testing
object TestActors {
def main(args: Array[String]): Unit = {
val a = MyActor
// val a = MyActor2
a.start()
testIt(a)
}
def testIt(a: Send[Int]) {
for (_ <- 0 to 5) {
for (i <- 0 to Const.NUM) {
a send i
}
}
}
}
EDIT: Based on Vasil's response, I removed the loop and tried it again. And then MyActor2 based on vars leapfrogged and now might be around 10% or so faster. So... lesson is: if you are confident that you won't end up with a stack overflowing backlog of messages, and you care to squeeze every little performance out... don't use loop and just call the act() method recursively.
Change for MyActor2:
override def act() =
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
act()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
act()
}
Such results are caused with the specifics of your benchmark (a lot of small messages that fill the actor's mailbox quicker than it can handle them).
Generally, the workflow of react is following:
Actor scans the mailbox;
If it finds a message, it schedules the execution;
When the scheduling completes, or, when there're no messages in the mailbox, actor suspends (Actor.suspendException is thrown);
In the first case, when the handler finishes to process the message, execution proceeds straight to react method, and, as long as there're lots of messages in the mailbox, actor immediately schedules the next message to execute, and only after that suspends.
In the second case, loop schedules the execution of react in order to prevent a stack overflow (which might be your case with Actor #1, because tail recursion in process is not optimized), and thus, execution doesn't proceed to react immediately, as in the first case. That's where the millis are lost.
UPDATE (taken from here):
Using loop instead of recursive react
effectively doubles the number of
tasks that the thread pool has to
execute in order to accomplish the
same amount of work, which in turn
makes it so any overhead in the
scheduler is far more pronounced when
using loop.
Just a wild stab in the dark. It might be due to the exception thrown by react in order to evacuate the loop. Exception creation is quite heavy. However I don't know how often it do that, but that should be possible to check with a catch and a counter.
The overhead on your test depends heavily on the number of threads that are present (try using only one thread with scala -Dactors.corePoolSize=1!). I'm finding it difficult to figure out exactly where the difference arises; the only real difference is that in one case you use loop and in the other you do not. Loop does do fair bit of work, since it repeatedly creates function objects using "andThen" rather than iterating. I'm not sure whether this is enough to explain the difference, especially in light of the heavy usage by scala.actors.Scheduler$.impl and ExceptionBlob.

Resources