Scala: call by name and call by value performance - performance

Consider the following code:
class A(val name: String)
Compare the two wrappers of A:
class B(a: A){
def name: String = a.name
}
and
class B1(a: A){
val name: String = a.name
}
B and B1 have the same functionality. How is the memory efficiency and computation efficiency comparison between them? Will Scala compiler view them as the same thing?

First, I'll say that micro optimizations questions are usually tricky to answer for their lack of context. Additionally, this question has nothing to do with call by name or call by value, as non of your examples are call by name.
Now, let's compile your code with scalac -Xprint:typer and see what gets emitted:
class B extends scala.AnyRef {
<paramaccessor> private[this] val a: A = _;
def <init>(a: A): B = {
B.super.<init>();
()
};
def name: String = B.this.a.name
};
class B1 extends scala.AnyRef {
<paramaccessor> private[this] val a: A = _;
def <init>(a: A): B1 = {
B1.super.<init>();
()
};
private[this] val name: String = B1.this.a.name;
<stable> <accessor> def name: String = B1.this.name
};
In class B, we hold a reference to a, and have a method name which calls the name value on A.
In class B1, we store name locally, since it is a value of B1 directly, not a method. By definition, val declarations in have a method generated for them and that is how they're accessed.
This boils down to the fact that B1 holds an additional reference to the name string allocated by A. Is this significant in any way from a performance perspective? I don't know. It looks negligible to me under general question you've posted, but I wouldn't be to bothered with this unless you've profiled your application and found this a bottleneck.
Lets take this one step further, and run a simple JMH micro benchmark on this:
[info] Benchmark Mode Cnt Score Error Units
[info] MicroBenchClasses.testB1Access thrpt 50 296.291 ± 20.787 ops/us
[info] MicroBenchClasses.testBAccess thrpt 50 303.866 ± 5.435 ops/us
[info] MicroBenchClasses.testB1Access avgt 9 0.004 ± 0.001 us/op
[info] MicroBenchClasses.testBAccess avgt 9 0.003 ± 0.001 us/op
We see that call times are identical, since in both times we're invoking a method. One thing we can notice is that the throughput on B is higher, why is that? Lets look at the byte code:
B:
public java.lang.String name();
Code:
0: aload_0
1: getfield #20 // Field a:Lcom/testing/SimpleTryExample$A$1;
4: invokevirtual #22 // Method com/testing/SimpleTryExample$A$1.name:()Ljava/lang/String;
7: areturn
B1:
public java.lang.String name();
Code:
0: aload_0
1: getfield #19 // Field name:Ljava/lang/String;
4: areturn
It isn't trivial to understand why a getfield would be slower than a invokevirtual, but in the end the JIT may inline the getter call to name. This goes to show you that you should take nothing for granted, benchmark everything.
Code for test:
import java.util.concurrent.TimeUnit
import org.openjdk.jmh.annotations._
/**
* Created by Yuval.Itzchakov on 19/10/2017.
*/
#State(Scope.Thread)
#Warmup(iterations = 3, time = 1)
#Measurement(iterations = 3)
#BenchmarkMode(Array(Mode.AverageTime, Mode.Throughput))
#OutputTimeUnit(TimeUnit.MICROSECONDS)
#Fork(3)
class MicroBenchClasses {
class A(val name: String)
class B(a: A){
def name: String = a.name
}
class B1(a: A){
val name: String = a.name
}
var b: B = _
var b1: B1 = _
#Setup
def setup() = {
val firstA = new A("yuval")
val secondA = new A("yuval")
b = new B(firstA)
b1 = new B1(secondA)
}
#Benchmark
def testBAccess(): String = {
b.name
}
#Benchmark
def testB1Access(): String = {
b1.name
}
}

Related

Using par map to increase performance

Below code runs a comparison of users and writes to file. I've removed some code to make it as concise as possible but speed is an issue also in this code :
import scala.collection.JavaConversions._
object writedata {
def getDistance(str1: String, str2: String) = {
val zipped = str1.zip(str2)
val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2
val p = zipped.count(_ == ('1', '1')).toFloat * 2
val q = zipped.count(_ == ('1', '0')).toFloat * 2
val r = zipped.count(_ == ('0', '1')).toFloat * 2
val s = zipped.count(_ == ('0', '0')).toFloat * 2
(q + r) / (p + q + r)
} //> getDistance: (str1: String, str2: String)Float
case class UserObj(id: String, nCoordinate: String)
val userList = new java.util.ArrayList[UserObj] //> userList : java.util.ArrayList[writedata.UserObj] = []
for (a <- 1 to 100) {
userList.add(new UserObj("2", "101010"))
}
def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
try { f(param) } finally { param.close() } //> using: [A <: AnyRef{def close(): Unit}, B](param: A)(f: A => B)B
def appendToFile(fileName: String, textData: String) =
using(new java.io.FileWriter(fileName, true)) {
fileWriter =>
using(new java.io.PrintWriter(fileWriter)) {
printWriter => printWriter.println(textData)
}
} //> appendToFile: (fileName: String, textData: String)Unit
var counter = 0; //> counter : Int = 0
for (xUser <- userList.par) {
userList.par.map(yUser => {
if (!xUser.id.isEmpty && !yUser.id.isEmpty)
synchronized {
appendToFile("c:\\data-files\\test.txt", getDistance(xUser.nCoordinate , yUser.nCoordinate).toString)
}
})
}
}
The above code was previously an imperative solution, so the .par functionality was within an inner and outer loop. I'm attempting to convert it to a more functional implementation while also taking advantage of Scala's parallel collections framework.
In this example the data set size is 10 but in the code im working on
the size is 8000 which translates to 64'000'000 comparisons. I'm
using a synchronized block so that multiple threads are not writing
to same file at same time. A performance improvment im considering
is populating a separate collection within the inner loop ( userList.par.map(yUser => {)
and then writing that collection out to seperate file.
Are there other methods I can use to improve performance. So that I can
handle a List that contains 8000 items instead of above example of 100 ?
I'm not sure if you removed too much code for clarity, but from what I can see, there is absolutely nothing that can run in parallel since the only thing you are doing is writing to a file.
EDIT:
One thing that you should do is to move the getDistance(...) computation before the synchronized call to appendToFile, otherwise your parallelized code ends up being sequential.
Instead of calling a synchronized appendToFile, I would call appendToFile in a non-synchronized way, but have each call to that method add the new line to some synchronized queue. Then I would have another thread that flushes that queue to disk periodically. But then you would also need to add something to make sure that the queue is also flushed when all computations are done. So that could get complicated...
Alternatively, you could also keep your code and simply drop the synchronization around the call to appendToFile. It seems that println itself is synchronized. However, that would be risky since println is not officially synchronized and it could change in future versions.

Accessing position information in a scala combinatorparser kills performance

I wrote a new combinator for my parser in scala.
Its a variation of the ^^ combinator, which passes position information on.
But accessing the position information of the input element really cost performance.
In my case parsing a big example need around 3 seconds without position information, with it needs over 30 seconds.
I wrote a runnable example where the runtime is about 50% more when accessing the position.
Why is that? How can I get a better runtime?
Example:
import scala.util.parsing.combinator.RegexParsers
import scala.util.parsing.combinator.Parsers
import scala.util.matching.Regex
import scala.language.implicitConversions
object FooParser extends RegexParsers with Parsers {
var withPosInfo = false
def b: Parser[String] = regexB("""[a-z]+""".r) ^^# { case (b, x) => b + " ::" + x.toString }
def regexB(p: Regex): BParser[String] = new BParser(regex(p))
class BParser[T](p: Parser[T]) {
def ^^#[U](f: ((Int, Int), T) => U): Parser[U] = Parser { in =>
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
val inwo = in.drop(start - offset)
p(inwo) match {
case Success(t, in1) =>
{
var a = 3
var b = 4
if(withPosInfo)
{ // takes a lot of time
a = inwo.pos.line
b = inwo.pos.column
}
Success(f((a, b), t), in1)
}
case ns: NoSuccess => ns
}
}
}
def main(args: Array[String]) = {
val r = "foo"*50000000
var now = System.nanoTime
parseAll(b, r)
var us = (System.nanoTime - now) / 1000
println("without: %d us".format(us))
withPosInfo = true
now = System.nanoTime
parseAll(b, r)
us = (System.nanoTime - now) / 1000
println("with : %d us".format(us))
}
}
Output:
without: 2952496 us
with : 4591070 us
Unfortunately, I don't think you can use the same approach. The problem is that line numbers end up implemented by scala.util.parsing.input.OffsetPosition which builds a list of every line break every time it is created. So if it ends up with string input it will parse the entire thing on every call to pos (twice in your example). See the code for CharSequenceReader and OffsetPosition for more details.
There is one quick thing you can do to speed this up:
val ip = inwo.pos
a = ip.line
b = ip.column
to at least avoid creating pos twice. But that still leaves you with a lot of redundant work. I'm afraid to really solve the problem you'll have to build the index as in OffsetPosition yourself, just once, and then keep referring to it.
You could also file a bug report / make an enhancement request. This is not a very good way to implement the feature.

Is this a bug in Scala 2.9.1 lazy implementation or just an artifact of decompilation

I am considering using Scala on a pretty computationally intensive program. Profiling the C++ version of our code reveals that we could benefit significantly from Lazy evaluation. I have tried it out in Scala 2.9.1 and really like it. However, when I ran the class through a decompiler the implemenation didn't look quite right. I'm assuming that it's an artifact of the decompiler, but I wanted to get a more conclusive answer...
consider the following trivial example:
class TrivialAngle(radians : Double)
{
lazy val sin = math.sin(radians)
}
when I decompile it, I get this:
import scala.ScalaObject;
import scala.math.package.;
import scala.reflect.ScalaSignature;
#ScalaSignature(bytes="omitted")
public class TrivialAngle
implements ScalaObject
{
private final double radians;
private double sin;
public volatile int bitmap$0;
public double sin()
{
if ((this.bitmap$0 & 0x1) == 0);
synchronized (this)
{
if (
(this.bitmap$0 & 0x1) == 0)
{
this.sin = package..MODULE$.sin(this.radians);
this.bitmap$0 |= 1;
}
return this.sin;
}
}
public TrivialAngle(double radians)
{
}
}
To me, the return block is in the wrong spot, and you will always acquire the lock. This can't be what the real code is doing, but I am unable to confirm this. Can anyone confirm or deny that I have a bogus decompilation, and that the lazy implementation is somewhat reasonable (ie, only locks when it is computing the value, and doesn't acquire the lock for subsequent calls?)
Thanks!
For reference, this is the decompiler I used:
http://java.decompiler.free.fr/?q=jdgui
scala -Xprint:jvm reveals the true story:
[[syntax trees at end of jvm]]// Scala source: lazy.scala
package <empty> {
class TrivialAngle extends java.lang.Object with ScalaObject {
#volatile protected var bitmap$0: Int = 0;
<paramaccessor> private[this] val radians: Double = _;
lazy private[this] var sin: Double = _;
<stable> <accessor> lazy def sin(): Double = {
if (TrivialAngle.this.bitmap$0.&(1).==(0))
{
TrivialAngle.this.synchronized({
if (TrivialAngle.this.bitmap$0.&(1).==(0))
{
TrivialAngle.this.sin = scala.math.`package`.sin(TrivialAngle.this.radians);
TrivialAngle.this.bitmap$0 = TrivialAngle.this.bitmap$0.|(1);
()
};
scala.runtime.BoxedUnit.UNIT
});
()
};
TrivialAngle.this.sin
};
def this(radians: Double): TrivialAngle = {
TrivialAngle.this.radians = radians;
TrivialAngle.super.this();
()
}
}
}
It's a (since JVM 1.5) safe, and very fast, double checked lock.
More details:
What's the (hidden) cost of Scala's lazy val?
Be aware that if you have multiple lazy val members in a class, only one of them can be initialized at once, as they are guarded by synchronized(this) { ... }.
What I get with javap -c does not correspond to your decompile. In particular, there is no monitor enter when the field is found to be initialized. Version 2.9.1 too. There is still the memory barrier implied by the volatile access of course, so it does not come completely free. Comments starting with /// are mine
public double sin();
Code:
0: aload_0
1: getfield #14; //Field bitmap$0:I
4: iconst_1
5: iand
6: iconst_0
7: if_icmpne 54 /// if getField & 1 == O goto 54, skip lock
10: aload_0
11: dup
12: astore_1
13: monitorenter
/// 14 to 52 reasonably equivalent to synchronized block
/// in your decompiled code, without the return
53: monitorexit
54: aload_0
55: getfield #27; //Field sin:D
58: dreturn /// return outside lock
59: aload_1 /// (this would be the finally implied by the lock)
60: monitorexit
61: athrow
Exception table:
from to target type
14 54 59 any

What is the performance impact of Scala implicit type conversions?

In Scala, is there a significant CPU or memory impact to using implicit type conversions to augment a class's functionality vs. other possible implementation choices?
For example, consider a silly String manipulation function. This implementation uses string concatenation:
object Funky {
def main(args: Array[String]) {
args foreach(arg => println("Funky " + arg))
}
}
This implementation hides the concatenation behind a member method by using an implicit type conversion:
class FunkyString(str: String) {
def funkify() = "Funky " + str
}
object ImplicitFunky {
implicit def asFunkyString(str: String) = new FunkyString(str)
def main(args: Array[String]) {
args foreach(arg => println(arg.funkify()))
}
}
Both do the same thing:
scala> Funky.main(Array("Cold Medina", "Town", "Drummer"))
Funky Cold Medina
Funky Town
Funky Drummer
scala> ImplicitFunky.main(Array("Cold Medina", "Town", "Drummer"))
Funky Cold Medina
Funky Town
Funky Drummer
Is there any performance difference? A few specific considerations:
Does Scala inline the implicit calls to the asFunkyString method?
Does Scala actually create a new wrapper FunkyString object for each arg, or can it optimize away the extra object allocations?
Suppose FunkyString had 3 different methods (funkify1, funkify2, and funkify3), and the body of foreach called each one in succession:
println(arg.funkify1())
println(arg.funkify2())
println(arg.funkify3())
Would Scala repeat the conversion 3 times, or would it optimize away the redundant conversions and just do it once for each loop iteration?
Suppose instead that I explicitly capture the conversion in another variable, like this:
val fs = asFunkyString(arg)
println(fs.funkify1())
println(fs.funkify2())
println(fs.funkify3())
Does that change the situation?
In practical terms, is broad usage of implicit conversions a potential performance issue, or is it typically harmless?
I tried to setup a microbenchmark using the excellent Scala-Benchmark-Template.
It is very difficult to write a meaningful (non optimized away by the JIT) benchmark which tests just the implicit conversions, so I had to add a bit of overhead.
Here is the code:
class FunkyBench extends SimpleScalaBenchmark {
val N = 10000
def timeDirect( reps: Int ) = repeat(reps) {
var strs = List[String]()
var s = "a"
for( i <- 0 until N ) {
s += "a"
strs ::= "Funky " + s
}
strs
}
def timeImplicit( reps: Int ) = repeat(reps) {
import Funky._
var strs = List[String]()
var s = "a"
for( i <- 0 until N ) {
s += "a"
strs ::= s.funkify
}
strs
}
}
And here are the results:
[info] benchmark ms linear runtime
[info] Direct 308 =============================
[info] Implicit 309 ==============================
My conclusion: in any non trivial piece of code, the impact of implicit conversions (object creation) is not measurable.
EDIT: I used scala 2.9.0 and java 1.6.0_24 (in server mode)
JVM can optimize away the extra object allocations, if it detects that would be worthy.
This is important, because if you just inline things you end up with bigger methods, which might cause performance problems with cache or even decrease the chance of JVM applying other optimizations.

What's the (hidden) cost of Scala's lazy val?

One handy feature of Scala is lazy val, where the evaluation of a val is delayed until it's necessary (at first access).
Of course, a lazy val must have some overhead - somewhere Scala must keep track of whether the value has already been evaluated and the evaluation must be synchronized, because multiple threads might try to access the value for the first time at the same time.
What exactly is the cost of a lazy val - is there a hidden boolean flag associated with a lazy val to keep track if it has been evaluated or not, what exactly is synchronized and are there any more costs?
In addition, suppose I do this:
class Something {
lazy val (x, y) = { ... }
}
Is this the same as having two separate lazy vals x and y or do I get the overhead only once, for the pair (x, y)?
This is taken from the scala mailing list and gives implementation details of lazy in terms of Java code (rather than bytecode):
class LazyTest {
lazy val msg = "Lazy"
}
is compiled to something equivalent to the following Java code:
class LazyTest {
public int bitmap$0;
private String msg;
public String msg() {
if ((bitmap$0 & 1) == 0) {
synchronized (this) {
if ((bitmap$0 & 1) == 0) {
synchronized (this) {
msg = "Lazy";
}
}
bitmap$0 = bitmap$0 | 1;
}
}
return msg;
}
}
It looks like the compiler arranges for a class-level bitmap int field to flag multiple lazy fields as initialized (or not) and initializes the target field in a synchronized block if the relevant xor of the bitmap indicates it is necessary.
Using:
class Something {
lazy val foo = getFoo
def getFoo = "foo!"
}
produces sample bytecode:
0 aload_0 [this]
1 getfield blevins.example.Something.bitmap$0 : int [15]
4 iconst_1
5 iand
6 iconst_0
7 if_icmpne 48
10 aload_0 [this]
11 dup
12 astore_1
13 monitorenter
14 aload_0 [this]
15 getfield blevins.example.Something.bitmap$0 : int [15]
18 iconst_1
19 iand
20 iconst_0
21 if_icmpne 42
24 aload_0 [this]
25 aload_0 [this]
26 invokevirtual blevins.example.Something.getFoo() : java.lang.String [18]
29 putfield blevins.example.Something.foo : java.lang.String [20]
32 aload_0 [this]
33 aload_0 [this]
34 getfield blevins.example.Something.bitmap$0 : int [15]
37 iconst_1
38 ior
39 putfield blevins.example.Something.bitmap$0 : int [15]
42 getstatic scala.runtime.BoxedUnit.UNIT : scala.runtime.BoxedUnit [26]
45 pop
46 aload_1
47 monitorexit
48 aload_0 [this]
49 getfield blevins.example.Something.foo : java.lang.String [20]
52 areturn
53 aload_1
54 monitorexit
55 athrow
Values initialed in tuples like lazy val (x,y) = { ... } have nested caching via the same mechanism. The tuple result is lazily evaluated and cached, and an access of either x or y will trigger the tuple evaluation. Extraction of the individual value from the tuple is done independently and lazily (and cached). So the above double-instantiation code generates an x, y, and an x$1 field of type Tuple2.
With Scala 2.10, a lazy value like:
class Example {
lazy val x = "Value";
}
is compiled to byte code that resembles the following Java code:
public class Example {
private String x;
private volatile boolean bitmap$0;
public String x() {
if(this.bitmap$0 == true) {
return this.x;
} else {
return x$lzycompute();
}
}
private String x$lzycompute() {
synchronized(this) {
if(this.bitmap$0 != true) {
this.x = "Value";
this.bitmap$0 = true;
}
return this.x;
}
}
}
Note that the bitmap is represented by a boolean. If you add another field, the compiler will increase the size of the field to being able to represent at least 2 values, i.e. as a byte. This just goes on for huge classes.
But you might wonder why this works? The thread-local caches must be cleared when entering a synchronized block such that the non-volatile x value is flushed into memory. This blog article gives an explanation.
Scala SIP-20 proposes a new implementation of lazy val, which is more correct but ~25% slower than the "current" version.
The proposed implementation looks like:
class LazyCellBase { // in a Java file - we need a public bitmap_0
public static AtomicIntegerFieldUpdater<LazyCellBase> arfu_0 =
AtomicIntegerFieldUpdater.newUpdater(LazyCellBase.class, "bitmap_0");
public volatile int bitmap_0 = 0;
}
final class LazyCell extends LazyCellBase {
import LazyCellBase._
var value_0: Int = _
#tailrec final def value(): Int = (arfu_0.get(this): #switch) match {
case 0 =>
if (arfu_0.compareAndSet(this, 0, 1)) {
val result = 0
value_0 = result
#tailrec def complete(): Unit = (arfu_0.get(this): #switch) match {
case 1 =>
if (!arfu_0.compareAndSet(this, 1, 3)) complete()
case 2 =>
if (arfu_0.compareAndSet(this, 2, 3)) {
synchronized { notifyAll() }
} else complete()
}
complete()
result
} else value()
case 1 =>
arfu_0.compareAndSet(this, 1, 2)
synchronized {
while (arfu_0.get(this) != 3) wait()
}
value_0
case 2 =>
synchronized {
while (arfu_0.get(this) != 3) wait()
}
value_0
case 3 => value_0
}
}
As of June 2013 this SIP hasn't been approved. I expect that it's likely to be approved and included in a future version of Scala based on the mailing list discussion. Consequently, I think you'd be wise to heed Daniel Spiewak's observation:
Lazy val is *not* free (or even cheap). Use it only if you absolutely
need laziness for correctness, not for optimization.
I've written a post with regard to this issue https://dzone.com/articles/cost-laziness
In nutshell, the penalty is so small that in practice you can ignore it.
given the bycode generated by scala for lazy, it can suffer thread safety problem as mentioned in double check locking http://www.javaworld.com/javaworld/jw-05-2001/jw-0525-double.html?page=1

Resources