Scala stateful actor, recursive calling faster than using vars? - performance

Sample code below. I'm a little curious why MyActor is faster than MyActor2. MyActor recursively calls process/react and keeps state in the function parameters whereas MyActor2 keeps state in vars. MyActor even has the extra overhead of tupling the state but still runs faster. I'm wondering if there is a good explanation for this or if maybe I'm doing something "wrong".
I realize the performance difference is not significant but the fact that it is there and consistent makes me curious what's going on here.
Ignoring the first two runs as warmup, I get:
MyActor:
559
511
544
529
vs.
MyActor2:
647
613
654
610
import scala.actors._
object Const {
val NUM = 100000
val NM1 = NUM - 1
}
trait Send[MessageType] {
def send(msg: MessageType)
}
// Test 1 using recursive calls to maintain state
abstract class StatefulTypedActor[MessageType, StateType](val initialState: StateType) extends Actor with Send[MessageType] {
def process(state: StateType, message: MessageType): StateType
def act = proc(initialState)
def send(message: MessageType) = {
this ! message
}
private def proc(state: StateType) {
react {
case msg: MessageType => proc(process(state, msg))
}
}
}
object MyActor extends StatefulTypedActor[Int, (Int, Long)]((0, 0)) {
override def process(state: (Int, Long), input: Int) = input match {
case 0 =>
(1, System.currentTimeMillis())
case input: Int =>
state match {
case (Const.NM1, start) =>
println((System.currentTimeMillis() - start))
(Const.NUM, start)
case (s, start) =>
(s + 1, start)
}
}
}
// Test 2 using vars to maintain state
object MyActor2 extends Actor with Send[Int] {
private var state = 0
private var strt = 0: Long
def send(message: Int) = {
this ! message
}
def act =
loop {
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
}
}
}
// main: Run testing
object TestActors {
def main(args: Array[String]): Unit = {
val a = MyActor
// val a = MyActor2
a.start()
testIt(a)
}
def testIt(a: Send[Int]) {
for (_ <- 0 to 5) {
for (i <- 0 to Const.NUM) {
a send i
}
}
}
}
EDIT: Based on Vasil's response, I removed the loop and tried it again. And then MyActor2 based on vars leapfrogged and now might be around 10% or so faster. So... lesson is: if you are confident that you won't end up with a stack overflowing backlog of messages, and you care to squeeze every little performance out... don't use loop and just call the act() method recursively.
Change for MyActor2:
override def act() =
react {
case 0 =>
state = 1
strt = System.currentTimeMillis()
act()
case input: Int =>
state match {
case Const.NM1 =>
println((System.currentTimeMillis() - strt))
state += 1
case s =>
state += 1
}
act()
}

Such results are caused with the specifics of your benchmark (a lot of small messages that fill the actor's mailbox quicker than it can handle them).
Generally, the workflow of react is following:
Actor scans the mailbox;
If it finds a message, it schedules the execution;
When the scheduling completes, or, when there're no messages in the mailbox, actor suspends (Actor.suspendException is thrown);
In the first case, when the handler finishes to process the message, execution proceeds straight to react method, and, as long as there're lots of messages in the mailbox, actor immediately schedules the next message to execute, and only after that suspends.
In the second case, loop schedules the execution of react in order to prevent a stack overflow (which might be your case with Actor #1, because tail recursion in process is not optimized), and thus, execution doesn't proceed to react immediately, as in the first case. That's where the millis are lost.
UPDATE (taken from here):
Using loop instead of recursive react
effectively doubles the number of
tasks that the thread pool has to
execute in order to accomplish the
same amount of work, which in turn
makes it so any overhead in the
scheduler is far more pronounced when
using loop.

Just a wild stab in the dark. It might be due to the exception thrown by react in order to evacuate the loop. Exception creation is quite heavy. However I don't know how often it do that, but that should be possible to check with a catch and a counter.

The overhead on your test depends heavily on the number of threads that are present (try using only one thread with scala -Dactors.corePoolSize=1!). I'm finding it difficult to figure out exactly where the difference arises; the only real difference is that in one case you use loop and in the other you do not. Loop does do fair bit of work, since it repeatedly creates function objects using "andThen" rather than iterating. I'm not sure whether this is enough to explain the difference, especially in light of the heavy usage by scala.actors.Scheduler$.impl and ExceptionBlob.

Related

Memory leak with Rayon and Indicatif in Rust

So, I'm trying to do a exhaustive search of a hash. The hash itself is not important here. As I want to use all processing power of my CPU, I'm using Rayon to get a thread pool and lots of tasks. The search algorithm is the following:
let (tx, rx) = mpsc::channel();
let original_hash = String::from(original_hash);
rayon::spawn(move || {
let mut i = 0;
let mut iter = SenhaIterator::from(initial_pwd);
while i < max_iteracoes {
let pwd = iter.next().unwrap();
let clone_tx = tx.clone();
rayon::spawn(move || {
let hash = calcula_hash(&pwd);
clone_tx.send((pwd, hash)).unwrap();
});
i += 1;
}
});
let mut last_pwd = None;
let bar = ProgressBar::new(max_iteracoes as u64);
while let Ok((pwd, hash)) = rx.recv() {
last_pwd = Some(pwd);
if hash == original_hash {
bar.finish();
return last_pwd.map_or(ResultadoSenha::SenhaNaoEncontrada(None), |s| {
ResultadoSenha::SenhaEncontrada(s)
});
}
bar.inc(1);
}
bar.finish();
ResultadoSenha::SenhaNaoEncontrada(last_pwd)
Just a high level explanation: as the tasks go completing their work, they send a pair of (password, hash) to the main thread, which will compare the hash with the original hash (the one I'm trying to find a password for). If they match, great, I return to main with an enum value that indicates success, and the password that produces the original hash. After all iterations end, I'll return to main with an enum value that indicates that no hash was found, but with the last password, so I can retry from this point in some future run.
I'm trying to use Indicatif to show a progress bar, so I can get a glimpse of the progress.
But my problem is that the program is growing it's memory usage without a clear reason why. If I make it run with, let's say, 1 billion iterations, it goes slowly adding memory until it fills all available system memory.
But when I comment the line bar.inc(1);, the program behaves as expected, with normal memory usage.
I've created a test program, with Rayon and Indicatif, but without the hash calculation and it works correctly, no memory misbehavior.
That makes me think that I'm doing something wrong with memory management in my code, but I can't see anything obvious.
I found a solution, but I'm still not sure why it solves the original problem.
What I did to solve it is to transfer the progress code to the first spawn closure. Look at lines 6 and 19 below:
let (tx, rx) = mpsc::channel();
let original_hash = String::from(original_hash);
rayon::spawn(move || {
let mut bar = ProgressBar::new(max_iteracoes as u64);
let mut i = 0;
let mut iter = SenhaIterator::from(initial_pwd);
while i < max_iteracoes {
let pwd = iter.next().unwrap();
let clone_tx = tx.clone();
rayon::spawn(move || {
let hash = calcula_hash(&pwd);
clone_tx.send((pwd, hash)).unwrap();
});
i += 1;
bar.inc();
}
bar.finish();
});
let mut latest_pwd = None;
while let Ok((pwd, hash)) = rx.recv() {
latest_pwd = Some(pwd);
if hash == original_hash {
return latest_pwd.map_or(PasswordOutcome::PasswordNotFound(None), |s| {
PasswordOutcome::PasswordFound(s)
})
}
}
PasswordOutcome::PasswordNotFound(latest_pwd)
That first spawn closure has the role of fetching the next password to try and pass it over to a worker task, which calculates the corresponding hash and sends the pair (password, hash) to the main thread. The main thread will wait for the pairs from the rx channel and compare with the expected hash.
What is still missing from me is why tracking the progress on the outer thread leaks memory. I couldn't identify what is really leaking. But it's working now and I'm happy with the result.

Using par map to increase performance

Below code runs a comparison of users and writes to file. I've removed some code to make it as concise as possible but speed is an issue also in this code :
import scala.collection.JavaConversions._
object writedata {
def getDistance(str1: String, str2: String) = {
val zipped = str1.zip(str2)
val numberOfEqualSequences = zipped.count(_ == ('1', '1')) * 2
val p = zipped.count(_ == ('1', '1')).toFloat * 2
val q = zipped.count(_ == ('1', '0')).toFloat * 2
val r = zipped.count(_ == ('0', '1')).toFloat * 2
val s = zipped.count(_ == ('0', '0')).toFloat * 2
(q + r) / (p + q + r)
} //> getDistance: (str1: String, str2: String)Float
case class UserObj(id: String, nCoordinate: String)
val userList = new java.util.ArrayList[UserObj] //> userList : java.util.ArrayList[writedata.UserObj] = []
for (a <- 1 to 100) {
userList.add(new UserObj("2", "101010"))
}
def using[A <: { def close(): Unit }, B](param: A)(f: A => B): B =
try { f(param) } finally { param.close() } //> using: [A <: AnyRef{def close(): Unit}, B](param: A)(f: A => B)B
def appendToFile(fileName: String, textData: String) =
using(new java.io.FileWriter(fileName, true)) {
fileWriter =>
using(new java.io.PrintWriter(fileWriter)) {
printWriter => printWriter.println(textData)
}
} //> appendToFile: (fileName: String, textData: String)Unit
var counter = 0; //> counter : Int = 0
for (xUser <- userList.par) {
userList.par.map(yUser => {
if (!xUser.id.isEmpty && !yUser.id.isEmpty)
synchronized {
appendToFile("c:\\data-files\\test.txt", getDistance(xUser.nCoordinate , yUser.nCoordinate).toString)
}
})
}
}
The above code was previously an imperative solution, so the .par functionality was within an inner and outer loop. I'm attempting to convert it to a more functional implementation while also taking advantage of Scala's parallel collections framework.
In this example the data set size is 10 but in the code im working on
the size is 8000 which translates to 64'000'000 comparisons. I'm
using a synchronized block so that multiple threads are not writing
to same file at same time. A performance improvment im considering
is populating a separate collection within the inner loop ( userList.par.map(yUser => {)
and then writing that collection out to seperate file.
Are there other methods I can use to improve performance. So that I can
handle a List that contains 8000 items instead of above example of 100 ?
I'm not sure if you removed too much code for clarity, but from what I can see, there is absolutely nothing that can run in parallel since the only thing you are doing is writing to a file.
EDIT:
One thing that you should do is to move the getDistance(...) computation before the synchronized call to appendToFile, otherwise your parallelized code ends up being sequential.
Instead of calling a synchronized appendToFile, I would call appendToFile in a non-synchronized way, but have each call to that method add the new line to some synchronized queue. Then I would have another thread that flushes that queue to disk periodically. But then you would also need to add something to make sure that the queue is also flushed when all computations are done. So that could get complicated...
Alternatively, you could also keep your code and simply drop the synchronization around the call to appendToFile. It seems that println itself is synchronized. However, that would be risky since println is not officially synchronized and it could change in future versions.

How to use scalatest to check that certain events happen in a particular order?

I need to test that my library processes certain (user-defined) events in the proper order. Currently, I'm doing something very simple. I create a buffer and let each of the events append a different value to it:
val buf = new collection.mutable.ArrayBuffer[Int];
val ev1 = () => { buf += 1; }
val ev2 = () => { buf += 2; }
//
// ... library runs the events ...
//
// check that ev2 ocurred before ev1
buf should be (ArrayBuffer(2, 1))
Is there a better and clearer way?
Update: Meanwhile I created a tiny toolkit that helps me with the tests. The main class Event allows to wrap computations and functions and registers when a computation occurred with respect to other events. I have only a little insight into scalatest so I don't know how to integrate it better - if you know, please suggest.
I know it's almost the same, but you could make your code a bit cleaner. If you need to test event order multiple times you could define a trait like this:
trait EventOrderTester {
val buf = ArrayBuffer.empty[Int]
def ev(order: Int): () => Unit = () => buf += order
lazy val expected = buf.sorted
}
Then you can define tests like this:
"my test" in new EventOrderTester {
x.addListener1(ev(2))
x.addListener2(ev(1))
//
// ... library runs the events ...
//
// check that listener2 ocurred before listener1
buf should be(expected)
}

Accessing position information in a scala combinatorparser kills performance

I wrote a new combinator for my parser in scala.
Its a variation of the ^^ combinator, which passes position information on.
But accessing the position information of the input element really cost performance.
In my case parsing a big example need around 3 seconds without position information, with it needs over 30 seconds.
I wrote a runnable example where the runtime is about 50% more when accessing the position.
Why is that? How can I get a better runtime?
Example:
import scala.util.parsing.combinator.RegexParsers
import scala.util.parsing.combinator.Parsers
import scala.util.matching.Regex
import scala.language.implicitConversions
object FooParser extends RegexParsers with Parsers {
var withPosInfo = false
def b: Parser[String] = regexB("""[a-z]+""".r) ^^# { case (b, x) => b + " ::" + x.toString }
def regexB(p: Regex): BParser[String] = new BParser(regex(p))
class BParser[T](p: Parser[T]) {
def ^^#[U](f: ((Int, Int), T) => U): Parser[U] = Parser { in =>
val source = in.source
val offset = in.offset
val start = handleWhiteSpace(source, offset)
val inwo = in.drop(start - offset)
p(inwo) match {
case Success(t, in1) =>
{
var a = 3
var b = 4
if(withPosInfo)
{ // takes a lot of time
a = inwo.pos.line
b = inwo.pos.column
}
Success(f((a, b), t), in1)
}
case ns: NoSuccess => ns
}
}
}
def main(args: Array[String]) = {
val r = "foo"*50000000
var now = System.nanoTime
parseAll(b, r)
var us = (System.nanoTime - now) / 1000
println("without: %d us".format(us))
withPosInfo = true
now = System.nanoTime
parseAll(b, r)
us = (System.nanoTime - now) / 1000
println("with : %d us".format(us))
}
}
Output:
without: 2952496 us
with : 4591070 us
Unfortunately, I don't think you can use the same approach. The problem is that line numbers end up implemented by scala.util.parsing.input.OffsetPosition which builds a list of every line break every time it is created. So if it ends up with string input it will parse the entire thing on every call to pos (twice in your example). See the code for CharSequenceReader and OffsetPosition for more details.
There is one quick thing you can do to speed this up:
val ip = inwo.pos
a = ip.line
b = ip.column
to at least avoid creating pos twice. But that still leaves you with a lot of redundant work. I'm afraid to really solve the problem you'll have to build the index as in OffsetPosition yourself, just once, and then keep referring to it.
You could also file a bug report / make an enhancement request. This is not a very good way to implement the feature.

What is the performance impact of Scala implicit type conversions?

In Scala, is there a significant CPU or memory impact to using implicit type conversions to augment a class's functionality vs. other possible implementation choices?
For example, consider a silly String manipulation function. This implementation uses string concatenation:
object Funky {
def main(args: Array[String]) {
args foreach(arg => println("Funky " + arg))
}
}
This implementation hides the concatenation behind a member method by using an implicit type conversion:
class FunkyString(str: String) {
def funkify() = "Funky " + str
}
object ImplicitFunky {
implicit def asFunkyString(str: String) = new FunkyString(str)
def main(args: Array[String]) {
args foreach(arg => println(arg.funkify()))
}
}
Both do the same thing:
scala> Funky.main(Array("Cold Medina", "Town", "Drummer"))
Funky Cold Medina
Funky Town
Funky Drummer
scala> ImplicitFunky.main(Array("Cold Medina", "Town", "Drummer"))
Funky Cold Medina
Funky Town
Funky Drummer
Is there any performance difference? A few specific considerations:
Does Scala inline the implicit calls to the asFunkyString method?
Does Scala actually create a new wrapper FunkyString object for each arg, or can it optimize away the extra object allocations?
Suppose FunkyString had 3 different methods (funkify1, funkify2, and funkify3), and the body of foreach called each one in succession:
println(arg.funkify1())
println(arg.funkify2())
println(arg.funkify3())
Would Scala repeat the conversion 3 times, or would it optimize away the redundant conversions and just do it once for each loop iteration?
Suppose instead that I explicitly capture the conversion in another variable, like this:
val fs = asFunkyString(arg)
println(fs.funkify1())
println(fs.funkify2())
println(fs.funkify3())
Does that change the situation?
In practical terms, is broad usage of implicit conversions a potential performance issue, or is it typically harmless?
I tried to setup a microbenchmark using the excellent Scala-Benchmark-Template.
It is very difficult to write a meaningful (non optimized away by the JIT) benchmark which tests just the implicit conversions, so I had to add a bit of overhead.
Here is the code:
class FunkyBench extends SimpleScalaBenchmark {
val N = 10000
def timeDirect( reps: Int ) = repeat(reps) {
var strs = List[String]()
var s = "a"
for( i <- 0 until N ) {
s += "a"
strs ::= "Funky " + s
}
strs
}
def timeImplicit( reps: Int ) = repeat(reps) {
import Funky._
var strs = List[String]()
var s = "a"
for( i <- 0 until N ) {
s += "a"
strs ::= s.funkify
}
strs
}
}
And here are the results:
[info] benchmark ms linear runtime
[info] Direct 308 =============================
[info] Implicit 309 ==============================
My conclusion: in any non trivial piece of code, the impact of implicit conversions (object creation) is not measurable.
EDIT: I used scala 2.9.0 and java 1.6.0_24 (in server mode)
JVM can optimize away the extra object allocations, if it detects that would be worthy.
This is important, because if you just inline things you end up with bigger methods, which might cause performance problems with cache or even decrease the chance of JVM applying other optimizations.

Resources