Which Scala features have poor performance - performance

I was wandering lately: as Scala is run on JVM, and latter is optimized for some types of operations, are there features whose implementation is really inefficient on JVM and which use therefore should be discouraged? Could you also explain why they are inefficient?
The first candidate would be functional programming features - as I know, functions are special classes with applymethod, which obviously creates additional overhead compared to languages where functions are just blocks of code.

Performance tuning is a deep and complex issue, but three things come immediately to mind.
Scala collections are good for expressive power, but not for performance.
Consider:
(1 to 20).map(x => x*x).sum
val a = new Array[Int](20)
var i = 0
while (i < 20) { a(i) = i+1; i += 1 } // (1 to 20)
i = 0
while (i < 20) { a(i) = a(i)*a(i); i += 1 } // map(x => x*x)
var s = 0
i = 0
while (i < 20) { s += a(i); i += 1 } // sum
s
The first is amazingly more compact. The second is 16x faster. Math on integers is really fast; boxing and unboxing is not. The generic collections code is, well, generic, and relies on boxing.
Function2 is only specialized on Int, Long, and Double arguments.
Anything other operation on primitives will require boxing. Beware!
Suppose you want to have a function where you can toggle a capability--maybe you want to capitalize letters or not. You try:
def doOdd(a: Array[Char], f: (Char, Boolean) => Char) = {
var i = 0
while (i<a.length) { a(i) = f(a(i), (i&1)==1); i += 1 }
a
}
And then you
val text = "The quick brown fox jumps over the lazy dog".toArray
val f = (c: Char, b: Boolean) => if (b) c.toUpper else c.toLower
scala> println( doOdd(text, f).mkString )
tHe qUiCk bRoWn fOx jUmPs oVeR ThE LaZy dOg
Okay, great! Except what if we
trait Func_CB_C { def apply(c: Char, b: Boolean): Char }
val g = new Func_CB_C {
def apply(c: Char, b: Boolean) = if (b) c.toUpper else c.toLower
}
def doOdd2(a: Array[Char], f: Func_CB_C) = {
var i = 0
while (i<a.length) { a(i) = f(a(i), (i&1)==1); i += 1 }
a
}
instead? Suddenly it's 3x faster. But if it's (Int, Int) => Int, (or any other permutation of Int/Long/Double arguments and Unit/Boolean/Int/Long/Float/Double return values), rolling your own is unnecessary--it's specialized and works at maximum speed.
Just because you can parallelize easily doesn't mean it's a good idea.
Scala's parallel collections will just try to run your code in parallel. It's up to you to make sure there's enough work so that running in parallel is a smart thing to do. There's a lot of overhead in setting up threads and collecting results. Take, for example,
val v = (1 to 1000).to[Vector]
v.map(x => x*(x+1))
versus
val u = (1 to 1000).to[Vector].par
u.map(x => x*(x+1))
The second map is faster, right, because it's parallel?
Hardly! It's 10x slower because of overhead (on my machine; results can vary substantially)
Summary
These are just a few of very many issues that you'll normally never have to worry about except in the most performance-critical parts of your code. There are oodles more, which eventually you'll encounter, but as I mentioned in my comment, it would take a book to cover a decent fraction of them. Note that there are oodles of performance issues in any language, and optimization is often tricky. Save your effort for where it matters!

Related

Trial division for primes with immutable collections in Scala

I am trying to learn Scala and functional programming ideology by rewriting basic exercises. Currently I have trouble with naive approach for generating primes "trial division".
The trouble described below is that I could not rewrite well-known algorithm in functional style preserving efficiency, because I have no suitable immutable data structure, like a List but with fast operations not only on head, but also on the very end.
I started with writing java code which for every odd number tests its divisibility by already found primes (limited by square root of value being tested) - and adds it to the end of the list if no divisor was found.
http://ideone.com/QE8U0I
List<Integer> primes = new ArrayList<>();
primes.add(2);
int cur = 3;
while (primes.size() < 100000) {
for (Integer x : primes) {
if (x * x > cur) {
primes.add(cur);
break;
}
if (cur % x == 0) {
break;
}
}
cur += 2;
}
Now I tried to rewrite it in "functional way" - there was no problem with using recursion instead of loops, but I stuck with immutable collections. Core idea is as following:
http://ideone.com/4DQ6mi
def primes(n: Int) = {
#tailrec
def divisibleByAny(x: Int, list: List[Int]): Boolean = {
if (list.isEmpty) false else {
val h = list.head
h * h <= x && (x % h == 0 || divisibleByAny(x, list.tail))
}
}
#tailrec
def morePrimes(from: Int, prev: List[Int]): List[Int] = {
if (prev.size == n) prev else
morePrimes(from + 2, if (divisibleByAny(from, prev)) prev else prev :+ from)
}
morePrimes(3, List(2))
}
But it is slow - if I understand correctly because operation of adding to the end of immutable list requires creation of new copy of the whole stuff.
I searched over documentation to find more suitable data structure and tried to substitute list with immutable Queue, for it is said:
Adding items to the queue always has cost O(1) ... Removing an item is on average O(1).
But it is still even slower:
http://ideone.com/v8BsuQ
def primes(n: Int) = {
#tailrec
def divisibleByAny(x: Int, list: Queue[Int]): Boolean = {
if (list.isEmpty) false else {
val (h, t) = list.dequeue
h * h <= x && (x % h == 0 || divisibleByAny(x, t))
}
}
#tailrec
def morePrimes(from: Int, prev: Queue[Int]): Queue[Int] = {
if (prev.size == n) prev else
morePrimes(from + 2, if (divisibleByAny(from, prev)) prev else prev.enqueue(from))
}
morePrimes(3, Queue(2))
}
What is going wrong or am I missing something?
P.S. I believe there are other algorithms for generating primes which are more suitable for functional style. I think I've seen some paper. But now I'm interested in this one, or more precisely in existence of suitable data structure.
According to http://docs.scala-lang.org/overviews/collections/performance-characteristics.html Vectors have an amortised constant cost for appending, prepending and seeking. Indeed, using vectors instead of lists in your solution is much faster
def primes(n: Int) = {
#tailrec
def divisibleByAny(x: Int, list: Vector[Int]): Boolean = {
if (list.isEmpty) false else {
val (h +: t) = list
h * h <= x && (x % h == 0 || divisibleByAny(x, t))
}
}
#tailrec
def morePrimes(from: Int, prev: Vector[Int]): Vector[Int] = {
if (prev.length == n) prev else
morePrimes(from + 2, if (divisibleByAny(from, prev)) prev else prev :+ from)
}
morePrimes(3, Vector(2))
}
http://ideone.com/x3k4A3
I think you have 2 main options
Use a Vector - which is better than a list for appending. It is a Bitmapped Trie data structure (http://en.wikipedia.org/wiki/Trie). It’s “effectively” O(1) for appending to (i.e. O(1) on average)
Or...possibly the answer you're not looking for
Use a mutable data structure like ListBuffer - immutability it great to try achieve, and should be your go to collections - but sometimes for efficiency reasons, you may use mutable structures . What is key it to make sure it does not “leak out” of your classes. If you look at the List.scala implementation, you’ll see ListBuffer used a lot internally. However, its coverted back to a List just before it leaves the class. If its good enough for the core Scala libraries, its probably ok for you to use under exceptional cases that warrant it.
Except using Vector, also consider using higher-order functions instead of recursion. That's also a completely valid functional style. On my machine the following implementation of divisibleByAny is about 8x faster, than #Pyetras tailrec implementation when running primes(1000000):
def divisibleByAny(x: Int, list: Vector[Int]): Boolean =
list.view.takeWhile(el => el * el <= x).exists(x % _ == 0)

Find prime numbers using Scala. Help me to improve

I wrote this code to find the prime numbers less than the given number i in scala.
def findPrime(i : Int) : List[Int] = i match {
case 2 => List(2)
case _ => {
val primeList = findPrime(i-1)
if(isPrime(i, primeList)) i :: primeList else primeList
}
}
def isPrime(num : Int, prePrimes : List[Int]) : Boolean = prePrimes.forall(num % _ != 0)
But, I got a feeling the findPrime function, especially this part:
case _ => {
val primeList = findPrime(i-1)
if(isPrime(i, primeList)) i :: primeList else primeList
}
is not quite in the functional style.
I am still learning functional programming. Can anyone please help me improve this code to make it more functional.
Many thanks.
Here's a functional implementation of the Sieve of Eratosthenes, as presented in Odersky's "Functional Programming Principles in Scala" Coursera course :
// Sieving integral numbers
def sieve(s: Stream[Int]): Stream[Int] = {
s.head #:: sieve(s.tail.filter(_ % s.head != 0))
}
// All primes as a lazy sequence
val primes = sieve(Stream.from(2))
// Dumping the first five primes
print(primes.take(5).toList) // List(2, 3, 5, 7, 11)
The style looks fine to me. Although the Sieve of Eratosthenes is a very efficient way to find prime numbers, your approach works well too, since you are only testing for division against known primes. You need to watch out however--your recursive function is not tail recursive. A tail recursive function does not modify the result of the recursive call--in your example you prepend to the result of the recursive call. This means that you will have a long call stack and so findPrime will not work for large i. Here is a tail-recursive solution.
def primesUnder(n: Int): List[Int] = {
require(n >= 2)
def rec(i: Int, primes: List[Int]): List[Int] = {
if (i >= n) primes
else if (prime(i, primes)) rec(i + 1, i :: primes)
else rec(i + 1, primes)
}
rec(2, List()).reverse
}
def prime(num: Int, factors: List[Int]): Boolean = factors.forall(num % _ != 0)
This solution isn't prettier--it's more of a detail to get your solution to work for large arguments. Since the list is built up backwards to take advantage of fast prepends, the list needs to be reversed. As an alternative, you could use an Array, Vector or a ListBuffer to append the results. With the Array, however, you would need to estimate how much memory to allocate for it. Fortunately we know that pi(n) is about equal to n / ln(n) so you can choose a reasonable size. Array and ListBuffer are also a mutable data types, which goes again your desire for functional style.
Update: to get good performance out of the Sieve of Eratosthenes I think you'll need to store data in a native array, which also goes against your desire for style in functional programming. There might be a creative functional implementation though!
Update: oops! Missed it! This approach works well too if you only divide by primes less than the square root of the number you are testing! I missed this, and unfortunately it's not easy to adjust my solution to do this because I'm storing the primes backwards.
Update: here's a very non-functional solution that at least only checks up to the square root.
rnative, you could use an Array, Vector or a ListBuffer to append the results. With the Array, however, you would need to estimate how much memory to allocate for it. Fortunately we know that pi(n) is about equal to n / ln(n) so you can choose a reasonable size. Array and ListBuffer are also a mutable data types, which goes again your desire for functional style.
Update: to get good performance out of the Sieve of Eratosthenes I think you'll need to store data in a native array, which also goes against your desire for style in functional programming. There might be a creative functional implementation though!
Update: oops! Missed it! This approach works well too if you only divide by primes less than the square root of the number you are testing! I missed this, and unfortunately it's not easy to adjust my solution to do this because I'm storing the primes backwards.
Update: here's a very non-functional solution that at least only checks up to the square root.
import scala.collection.mutable.ListBuffer
def primesUnder(n: Int): List[Int] = {
require(n >= 2)
val primes = ListBuffer(2)
for (i <- 3 to n) {
if (prime(i, primes.iterator)) {
primes += i
}
}
primes.toList
}
// factors must be in sorted order
def prime(num: Int, factors: Iterator[Int]): Boolean =
factors.takeWhile(_ <= math.sqrt(num).toInt) forall(num % _ != 0)
Or I could use Vectors with my original approach. Vectors are probably not the best solution because they don't have the fasted O(1) even though it's amortized O(1).
As schmmd mentions, you want it to be tail recursive, and you also want it to be lazy. Fortunately there is a perfect data-structure for this: Stream.
This is a very efficient prime calculator implemented as a Stream, with a few optimisations:
object Prime {
def is(i: Long): Boolean =
if (i == 2) true
else if ((i & 1) == 0) false // efficient div by 2
else prime(i)
def primes: Stream[Long] = 2 #:: prime3
private val prime3: Stream[Long] = {
#annotation.tailrec
def nextPrime(i: Long): Long =
if (prime(i)) i else nextPrime(i + 2) // tail
def next(i: Long): Stream[Long] =
i #:: next(nextPrime(i + 2))
3 #:: next(5)
}
// assumes not even, check evenness before calling - perf note: must pass partially applied >= method
def prime(i: Long): Boolean =
prime3 takeWhile (math.sqrt(i).>= _) forall { i % _ != 0 }
}
Prime.is is the prime check predicate, and Prime.primes returns a Stream of all prime numbers. prime3 is where the Stream is computed, using the prime predicate to check for all prime divisors less than the square root of i.
/**
* #return Bitset p such that p(x) is true iff x is prime
*/
def sieveOfEratosthenes(n: Int) = {
val isPrime = mutable.BitSet(2 to n: _*)
for (p <- 2 to Math.sqrt(n) if isPrime(p)) {
isPrime --= p*p to n by p
}
isPrime.toImmutable
}
A sieve method is your best bet for small lists of numbers (up to 10-100 million or so).
see: Sieve of Eratosthenes
Even if you want to find much larger numbers, you can use the list you generate with this method as divisors for testing numbers up to n^2, where n is the limit of your list.
#mfa has mentioned using a Sieve of Eratosthenes - SoE and #Luigi Plinge has mentioned that this should be done using functional code, so #netzwerg has posted a non-SoE version; here, I post a "almost" functional version of the SoE using completely immutable state except for the contents of a mutable BitSet (mutable rather than immutable for performance) that I posted as an answer to another question:
object SoE {
def makeSoE_Primes(top: Int): Iterator[Int] = {
val topndx = (top - 3) / 2
val nonprms = new scala.collection.mutable.BitSet(topndx + 1)
def cullp(i: Int) = {
import scala.annotation.tailrec; val p = i + i + 3
#tailrec def cull(c: Int): Unit = if (c <= topndx) { nonprms += c; cull(c + p) }
cull((p * p - 3) >>> 1)
}
(0 to (Math.sqrt(top).toInt - 3) >>> 1).filterNot { nonprms }.foreach { cullp }
Iterator.single(2) ++ (0 to topndx).filterNot { nonprms }.map { i: Int => i + i + 3 }
}
}
How about this.
def getPrimeUnder(n: Int) = {
require(n >= 2)
val ol = 3 to n by 2 toList // oddList
def pn(ol: List[Int], pl: List[Int]): List[Int] = ol match {
case Nil => pl
case _ if pl.exists(ol.head % _ == 0) => pn(ol.tail, pl)
case _ => pn(ol.tail, ol.head :: pl)
}
pn(ol, List(2)).reverse
}
It's pretty fast for me, in my mac, to get all prime under 100k, its take around 2.5 sec.
A scalar fp approach
// returns the list of primes below `number`
def primes(number: Int): List[Int] = {
number match {
case a
if (a <= 3) => (1 to a).toList
case x => (1 to x - 1).filter(b => isPrime(b)).toList
}
}
// checks if a number is prime
def isPrime(number: Int): Boolean = {
number match {
case 1 => true
case x => Nil == {
2 to math.sqrt(number).toInt filter(y => x % y == 0)
}
}
}
def primeNumber(range: Int): Unit ={
val primeNumbers: immutable.IndexedSeq[AnyVal] =
for (number :Int <- 2 to range) yield {
val isPrime = !Range(2, Math.sqrt(number).toInt).exists(x => number % x == 0)
if(isPrime) number
}
for(prime <- primeNumbers) println(prime)
}
object Primes {
private lazy val notDivisibleBy2: Stream[Long] = 3L #:: notDivisibleBy2.map(_ + 2)
private lazy val notDivisibleBy2Or3: Stream[Long] = notDivisibleBy2
.grouped(3)
.map(_.slice(1, 3))
.flatten
.toStream
private lazy val notDivisibleBy2Or3Or5: Stream[Long] = notDivisibleBy2Or3
.grouped(10)
.map { g =>
g.slice(1, 7) ++ g.slice(8, 10)
}
.flatten
.toStream
lazy val primes: Stream[Long] = 2L #::
notDivisibleBy2.head #::
notDivisibleBy2Or3.head #::
notDivisibleBy2Or3Or5.filter { i =>
i < 49 || primes.takeWhile(_ <= Math.sqrt(i).toLong).forall(i % _ != 0)
}
def apply(n: Long): Stream[Long] = primes.takeWhile(_ <= n)
def getPrimeUnder(n: Long): Long = Primes(n).last
}

Why is my Scala tail-recursion faster than the while loop?

Here are two solutions to exercise 4.9 in Cay Horstmann's Scala for the Impatient: "Write a function lteqgt(values: Array[Int], v: Int) that returns a triple containing the counts of values less than v, equal to v, and greater than v." One uses tail recursion, the other uses a while loop. I thought that both would compile to similar bytecode but the while loop is slower than the tail recursion by a factor of almost 2. This suggests to me that my while method is badly written.
import scala.annotation.tailrec
import scala.util.Random
object PerformanceTest {
def main(args: Array[String]): Unit = {
val bigArray:Array[Int] = fillArray(new Array[Int](100000000))
println(time(lteqgt(bigArray, 25)))
println(time(lteqgt2(bigArray, 25)))
}
def time[T](block : => T):T = {
val start = System.nanoTime : Double
val result = block
val end = System.nanoTime : Double
println("Time = " + (end - start) / 1000000.0 + " millis")
result
}
#tailrec def fillArray(a:Array[Int], pos:Int=0):Array[Int] = {
if (pos == a.length)
a
else {
a(pos) = Random.nextInt(50)
fillArray(a, pos+1)
}
}
#tailrec def lteqgt(values: Array[Int], v:Int, lt:Int=0, eq:Int=0, gt:Int=0, pos:Int=0):(Int, Int, Int) = {
if (pos == values.length)
(lt, eq, gt)
else
lteqgt(values, v, lt + (if (values(pos) < v) 1 else 0), eq + (if (values(pos) == v) 1 else 0), gt + (if (values(pos) > v) 1 else 0), pos+1)
}
def lteqgt2(values:Array[Int], v:Int):(Int, Int, Int) = {
var lt = 0
var eq = 0
var gt = 0
var pos = 0
val limit = values.length
while (pos < limit) {
if (values(pos) > v)
gt += 1
else if (values(pos) < v)
lt += 1
else
eq += 1
pos += 1
}
(lt, eq, gt)
}
}
Adjust the size of bigArray according to your heap size. Here is some sample output:
Time = 245.110899 millis
(50004367,2003090,47992543)
Time = 465.836894 millis
(50004367,2003090,47992543)
Why is the while method so much slower than the tailrec? Naively the tailrec version looks to be at a slight disadvantage, as it must always perform 3 "if" checks for every iteration, whereas the while version will often only perform 1 or 2 tests due to the else construct. (NB reversing the order I perform the two methods does not affect the outcome).
Test results (after reducing array size to 20000000)
Under Java 1.6.22 I get 151 and 122 ms for tail-recursion and while-loop respectively.
Under Java 1.7.0 I get 55 and 101 ms
So under Java 6 your while-loop is actually faster; both have improved in performance under Java 7, but the tail-recursive version has overtaken the loop.
Explanation
The performance difference is due to the fact that in your loop, you conditionally add 1 to the totals, while for recursion you always add either 1 or 0. So they are not equivalent. The equivalent while-loop to your recursive method is:
def lteqgt2(values:Array[Int], v:Int):(Int, Int, Int) = {
var lt = 0
var eq = 0
var gt = 0
var pos = 0
val limit = values.length
while (pos < limit) {
gt += (if (values(pos) > v) 1 else 0)
lt += (if (values(pos) < v) 1 else 0)
eq += (if (values(pos) == v) 1 else 0)
pos += 1
}
(lt, eq, gt)
}
and this gives exactly the same execution time as the recursive method (regardless of Java version).
Discussion
I'm not an expert on why the Java 7 VM (HotSpot) can optimize this better than your first version, but I'd guess it's because it's taking the same path through the code each time (rather than branching along the if / else if paths), so the bytecode can be inlined more efficiently.
But remember that this is not the case in Java 6. Why one while-loop outperforms the other is a question of JVM internals. Happily for the Scala programmer, the version produced from idiomatic tail-recursion is the faster one in the latest version of the JVM.
The difference could also be occurring at the processor level. See this question, which explains how code slows down if it contains unpredictable branching.
The two constructs are not identical. In particular, in the first case you don't need any jumps (on x86, you can use cmp and setle and add, instead of having to use cmp and jb and (if you don't jump) add. Not jumping is faster than jumping on pretty much every modern architecture.
So, if you have code that looks like
if (a < b) x += 1
where you may add or you may jump instead, vs.
x += (a < b)
(which only makes sense in C/C++ where 1 = true and 0 = false), the latter tends to be faster as it can be turned into more compact assembly code. In Scala/Java, you can't do this, but you can do
x += if (a < b) 1 else 0
which a smart JVM should recognize is the same as x += (a < b), which has a jump-free machine code translation, which is usually faster than jumping. An even smarter JVM would recognize that
if (a < b) x += 1
is the same yet again (because adding zero doesn't do anything).
C/C++ compilers routinely perform optimizations like this. Being unable to apply any of these optimizations was not a mark in the JIT compiler's favor; apparently it can as of 1.7, but only partially (i.e. it doesn't recognize that adding zero is the same as a conditional adding one, but it does at least convert x += if (a<b) 1 else 0 into fast machine code).
Now, none of this has anything to do with tail recursion or while loops per se. With tail recursion it's more natural to write the if (a < b) 1 else 0 form, but you can do either; and with while loops you can also do either. It just so happened that you picked one form for tail recursion and the other for the while loop, making it look like recursion vs. looping was the change instead of the two different ways to do the conditionals.

Scala PriorityQueue on Array[Int] performance issue with complex comparison function (caching is needed)

The problem involves the Scala PriorityQueue[Array[Int]] performance on large data set. The following operations are needed: enqueue, dequeue, and filter. Currently, my implementation is as follows:
For every element of type Array[Int], there is a complex evaluation function: (I'm not sure how to write it in a more efficient way, because it excludes the position 0)
def eval_fun(a : Array[Int]) =
if(a.size < 2) 3
else {
var ret = 0
var i = 1
while(i < a.size) {
if((a(i) & 0x3) == 1) ret += 1
else if((a(i) & 0x3) == 3) ret += 3
i += 1
}
ret / a.size
}
The ordering with a comparison function is based on the evaluation function: (Reversed, descendent order)
val arr_ord = new Ordering[Array[Int]] {
def compare(a : Array[Int], b : Array[Int]) = eval_fun(b) compare eval_fun(a) }
The PriorityQueue is defined as:
val pq: scala.collection.mutable.PriorityQueue[Array[Int]] = PriorityQueue()
Question:
Is there a more elegant and efficient way to write such a evaluation function? I'm thinking of using fold, but fold cannot exclude the position 0.
Is there a data structure to generate a priorityqueue with unique elements? Applying filter operation after each enqueue operation is not efficient.
Is there a cache method to reduce the evaluation computation? Since when adding a new element to the queue, every element may need to be evaluated by eval_fun again, which is not necessary if evaluated value of every element can be cached. Also, I should mention that two distinct element may have the same evaluated value.
Is there a more efficient data structure without using generic type? Because if the size of elements reaches 10,000 and the size of size reaches 1,000, the performance is terribly slow.
Thanks you.
(1) If you want maximum performance here, I would stick to the while loop, even if it is not terribly elegant. Otherwise, if you use a view on Array, you can easily drop the first element before going into the fold:
a.view.drop(1).foldLeft(0)( (sum, a) => sum + ((a & 0x03) match {
case 0x01 => 1
case 0x03 => 3
case _ => 0
})) / a.size
(2) You can maintain two structures, the priority queue, and a set. Both combined give you a sorted-set... So you could use collection.immutable.SortedSet, but there is no mutable variant in the standard library. Do want equality based on the priority function, or the actual array contents? Because in the latter case, you won't get around comparing arrays element by element for each insertion, undoing the effect of caching the priority function value.
(3) Just put the calculated priority along with the array in the queue. I.e.
implicit val ord = Ordering.by[(Int, Array[Int]), Int](_._1)
val pq = new collection.mutable.PriorityQueue[(Int, Array[Int])]
pq += eval_fun(a) -> a
Well, you can use a tail recursive loop (generally these are more "idiomatic":
def eval(a: Array[Int]): Int =
if (a.size < 2) 3
else {
#annotation.tailrec
def loop(ret: Int = 0, i: Int = 1): Int =
if (i >= a.size) ret / a.size
else {
val mod3 = (a(i) & 0x3)
if (mod3 == 1) loop(ret + 1, i + 1)
else if (mod3 == 3) loop(ret + 3, i + 1)
else loop(ret, i + 1)
}
loop()
}
Then you can use that to initialise a cached priority value:
case class PriorityArray(a: Array[Int]) {
lazy val priority = if (a.size < 2) 3 else {
#annotation.tailrec
def loop(ret: Int = 0, i: Int = 1): Int =
if (i >= a.size) ret / a.size
else {
val mod3 = (a(i) & 0x3)
if (mod3 == 2) loop(ret, i + 1)
else loop(ret + mod3, i + 1)
}
loop()
}
}
You may note also that I removed a redundant & op and have only the single conditional (for when it equals 2, rather than two checks for 1 && 3) – these should have some minimal effect.
There is not a huge difference from 0__'s proposal that just came though.
My answers:
If evaluation is critical, keep it as it is. You might get better performance with recursion (not sure why, but it happens), but you'll certainly get worse performance with pretty much any other approach.
No, there isn't, but you can come pretty close to it just modifying the dequeue operation:
def distinctDequeue[T](q: PriorityQueue[T]): T = {
val result = q.dequeue
while (q.head == result) q.dequeue
result
}
Otherwise, you'd have to keep a second data structure just to keep track of whether an element has been added or not. Either way, that equals sign is pretty heavy, but I have a suggestion to make it faster in the next item.
Note, however, that this requires that ties on the the cost function get solved in some other way.
Like 0__ suggested, put the cost on the priority queue. But you can also keep a cache on the function if that would be helpful. I'd try something like this:
val evalMap = scala.collection.mutable.HashMapWrappedArray[Int], Int
def eval_fun(a : Array[Int]) =
if(a.size < 2) 3
else evalMap.getOrElseUpdate(a, {
var ret = 0
var i = 1
while(i < a.size) {
if((a(i) & 0x3) == 1) ret += 1
else if((a(i) & 0x3) == 3) ret += 3
i += 1
}
ret / a.size
})
import scala.math.Ordering.Implicits._
val pq = new collection.mutable.PriorityQueue[(Int, WrappedArray[Int])]
pq += eval_fun(a) -> (a : WrappedArray[Int])
Note that I did not create a special Ordering -- I'm using the standard Ordering so that the WrappedArray will break the ties. There's little cost to wrap the Array, and you get it back with .array, but, on the other hand, you'll get the following:
Ties will be broken by comparing the array themselves. If there aren't many ties in the cost, this should be good enough. If there are, add something else to the tuple to help break ties without comparing the arrays.
That means all equal elements will be kept together, which will enable you to dequeue all of them at the same time, giving the impression of having kept only one.
And that equals will actually work, because WrappedArray compare like Scala sequences do.
I don't understand what you mean by that fourth point.

Why is filter in front of foldLeft slow in Scala?

I wrote an answer to the first Project Euler question:
Add all the natural numbers below one thousand that are multiples of 3 or 5.
The first thing that came to me was:
(1 until 1000).filter(i => (i % 3 == 0 || i % 5 == 0)).foldLeft(0)(_ + _)
but it's slow (it takes 125 ms), so I rewrote it, simply thinking of 'another way' versus 'the faster way'
(1 until 1000).foldLeft(0){
(total, x) =>
x match {
case i if (i % 3 == 0 || i % 5 ==0) => i + total // Add
case _ => total //skip
}
}
This is much faster (only 2 ms). Why? I'm guess the second version uses only the Range generator and doesn't manifest a fully realized collection in any way, doing it all in one pass, both faster and with less memory. Am I right?
Here the code on IdeOne: http://ideone.com/GbKlP
The problem, as others have said, is that filter creates a new collection. The alternative withFilter doesn't, but that doesn't have a foldLeft. Also, using .view, .iterator or .toStream would all avoid creating the new collection in various ways, but they are all slower here than the first method you used, which I thought somewhat strange at first.
But, then... See, 1 until 1000 is a Range, whose size is actually very small, because it doesn't store each element. Also, Range's foreach is extremely optimized, and is even specialized, which is not the case of any of the other collections. Since foldLeft is implemented as a foreach, as long as you stay with a Range you get to enjoy its optimized methods.
(_: Range).foreach:
#inline final override def foreach[#specialized(Unit) U](f: Int => U) {
if (length > 0) {
val last = this.last
var i = start
while (i != last) {
f(i)
i += step
}
f(i)
}
}
(_: Range).view.foreach
def foreach[U](f: A => U): Unit =
iterator.foreach(f)
(_: Range).view.iterator
override def iterator: Iterator[A] = new Elements(0, length)
protected class Elements(start: Int, end: Int) extends BufferedIterator[A] with Serializable {
private var i = start
def hasNext: Boolean = i < end
def next: A =
if (i < end) {
val x = self(i)
i += 1
x
} else Iterator.empty.next
def head =
if (i < end) self(i) else Iterator.empty.next
/** $super
* '''Note:''' `drop` is overridden to enable fast searching in the middle of indexed sequences.
*/
override def drop(n: Int): Iterator[A] =
if (n > 0) new Elements(i + n, end) else this
/** $super
* '''Note:''' `take` is overridden to be symmetric to `drop`.
*/
override def take(n: Int): Iterator[A] =
if (n <= 0) Iterator.empty.buffered
else if (i + n < end) new Elements(i, i + n)
else this
}
(_: Range).view.iterator.foreach
def foreach[U](f: A => U) { while (hasNext) f(next()) }
And that, of course, doesn't even count the filter between view and foldLeft:
override def filter(p: A => Boolean): This = newFiltered(p).asInstanceOf[This]
protected def newFiltered(p: A => Boolean): Transformed[A] = new Filtered { val pred = p }
trait Filtered extends Transformed[A] {
protected[this] val pred: A => Boolean
override def foreach[U](f: A => U) {
for (x <- self)
if (pred(x)) f(x)
}
override def stringPrefix = self.stringPrefix+"F"
}
Try making the collection lazy first, so
(1 until 1000).view.filter...
instead of
(1 until 1000).filter...
That should avoid the cost of building an intermediate collection. You might also get better performance from using sum instead of foldLeft(0)(_ + _), it's always possible that some collection type might have a more efficient way to sum numbers. If not, it's still cleaner and more declarative...
Looking through the code, it looks like filter does build a new Seq on which the foldLeft is called. The second skips that bit. It's not so much the memory, although that can't but help, but that the filtered collection is never built at all. All that work is never done.
Range uses TranversableLike.filter, which looks like this:
def filter(p: A => Boolean): Repr = {
val b = newBuilder
for (x <- this)
if (p(x)) b += x
b.result
}
I think it's the += on line 4 that's the difference. Filtering in foldLeft eliminates it.
filter creates a whole new sequence on which then foldLeft is called. Try:
(1 until 1000).view.filter(i => (i % 3 == 0 || i % 5 == 0)).reduceLeft(_+_)
This will prevent said effect, merely wrapping the original thing. Exchanging foldLeft with reduceLeft is only cosmetic (in this case).
Now the challenge is, can you think of a yet more efficient way? Not that your solution is too slow in this case, but how well does it scale? What if instead of 1000, it was 1000000000? There is a solution that could compute the latter case just as quickly as the former.

Resources