Scala performance question - performance

In the article written by Daniel Korzekwa, he said that the performance of following code:
list.map(e => e*2).filter(e => e>10)
is much worse than the iterative solution written using Java.
Can anyone explain why? And what is the best solution for such code in Scala (I hope it's not a Java iterative version which is Scala-fied)?

The reason that particular code is slow is because it's working on primitives but it's using generic operations, so the primitives have to be boxed. (This could be improved if List and its ancestors were specialized.) This will probably slow things down by a factor of 5 or so.
Also, algorithmically, those operations are somewhat expensive, because you make a whole list, and then make a whole new list throwing a few components of the intermediate list away. If you did it in one swoop, then you'd be better off. You could do something like:
list collect (case e if (e*2>10) => e*2)
but what if the calculation e*2 is really expensive? Then you could
(List[Int]() /: list)((ls,e) => { val x = e*2; if (x>10) x :: ls else ls }
except that this would appear backwards. (You could reverse it if need be, but that requires creating a new list, which again isn't ideal algorithmically.)
Of course, you have the same sort of algorithmic problems in Java if you're using a singly linked list--your new list will end up backwards, or you have to create it twice, first in reverse and then forwards, or you have to build it with (non-tail) recursion (which is easy in Scala, but inadvisable for this sort of thing in either language since you'll exhaust the stack), or you have to create a mutable list and then pretend afterwards that it's not mutable. (Which, incidentally, you can do in Scala also--see mutable.LinkedList.)

Basically it's traversing a list twice. Once for multiplying every element with two. And then another time to filter the results.
Think of it as one while loop producing a LinkedList with the intermediate results. And then another loop applying the filter to produce the final results.
This should be faster:
list.view.map(e => e * 2).filter(e => e > 10).force

The solution lies mostly with JVM. Though Scala has a workaround in the figure of #specialization, that increases the size of any specialized class hugely, and only solves half the problem -- the other half being the creation of temporary objects.
The JVM actually does a good job optimizing a lot of it, or the performance would be even more terrible, but Java does not require the optimizations that Scala does, so JVM does not provide them. I expect that to change to some extent with the introduction of SAM not-real-closures in Java.
But, in the end, it comes down to balancing the needs. The same while loop that Java and Scala do so much faster than Scala's function equivalent can be done faster yet in C. Yet, despite what the microbenchmarks tell us, people use Java.

Scala approach is much more abstract and generic. Therefore it is hard to optimize every single case.
I could imagine that HotSpot JIT compiler might apply stream- and loop-fusion to the code in the future if it sees that the immediate results are not used.
Additionally the Java code just does much more.
If you really just want to mutate over a datastructure, consider transform.
It looks a bit like map but doesn't create a new collection, e. g.:
val array = Array(1,2,3,4,5,6,7,8,9,10).transform(_ * 2)
// array is now WrappedArray(2, 4, 6, 8, 10, 12, 14, 16, 18, 20)
I really hope some additional in-place operations will be added ion the future...

To avoid traversing the list twice, I think the for syntax is a nice option here:
val list2 = for(v <- list1; e = v * 2; if e > 10) yield e

Rex Kerr correctly states the major problem: Operating on immutable lists, the stated piece of code creates intermediate lists in memory. Note that this is not necessarily slower than equivalent Java code; you just never use immutable datastructures in Java.
Wilfried Springer has a nice, Scala idomatic solution. Using view, no (manipulated) copies of the whole list are created.
Note that using view might not always be ideal. For example, if your first call is filter that is expected to throw away most of the list, is might be worthwhile to create the shorter version explicitly and use view only after that in order to improve memory locality for later iterations.

list.filter(e => e*2>10).map(e => e*2)
This attempt reduces first the List. So the second traversing is on less elements.

Related

IndexedSeq.last complexity

When working with indexed collections (most often immutable Vectors) I am often using coll.last as what I supposed to be a convenient short-cut to coll(coll.size-1). When randomly inspecting my sources, I have clicked to see the last implementation, and the IntelliJ IDE took me to TraversableLike.last implementation, which traverses all elements to eventually reach the last one.
This was a surprise to me, and I am not sure now what is the reason for this. Is last really implemented this way? Is there some reason preventing last to be implemented for IndexedSeq (or perhaps for IndexedSeqLike) efficiently?
(Scala SDK used is 2.11.4)
IndexedSeq does not override last (it only inherits it from TraversableLike) - the fact that a particular sequence supports indexed access does not necessarily make indexed lookups faster than traversals. However, such optimized implementations are given in IndexedSeqOptimized, which I would expect many implementations to inherit from. In the specific case of Vector, last is overridden explicitly in the class itself.
IndexedSeq has constant access time for the arbitrary element. LinearSeq has linear time. TraversableLike is just common interface and you may find that it's overriden inside IndexedSeqOptimized trait:
A template trait for indexed sequences of type IndexedSeq[A] which
optimizes the implementation of several methods under the
assumption of fast random access.
def last: A = if (length > 0) this(length - 1) else super.last
You may also find the quick random access implementation inside Vector.getElem - it uses a tree of arrays with high branching factor, so usually it's O(1) for apply. It doesn't use IndexedSeqOptimized, but it has its own overriden last:
override /*TraversableLike*/ def last: A = {
if (isEmpty) throw new UnsupportedOperationException("empty.last")
apply(length-1)
}
So it's a little mess inside Scala collections, which is very common for Scala internals. Anyway last on IndexedSeqs is O(1) de facto, regardless such tricky collections architecture.
The Scala collections intricacy is actually an active topic. A talk (and slides) with Scala's collection framework criticism may be found at Paul Phillips: Scala Collections: Why Not?, and Paul Phillips is developing his alternate version of std.

How does JIT optimize branching while processing elements of collections? (in Scala)

This is a question about performance of code written in Scala.
Consider the following two code snippets, assume that x is some collection containing ~50 million elements:
def process(x: Traversable[T]) = {
processFirst x.head
x reduce processPair
processLast x.last
}
Versus something like this (assume for now we have some way to determine if we're operating on the first element versus the last element):
def isFirstElement[T](x: T) = ???
def isLastElement[T](x: T) = ???
def process(x: Traversable[T]) = {
x reduce {
(left, right) =>
if (isFirstElement(left)
processFirst(left)
else if (isLastElement(right))
processLast(right)
processPair(left, right)
}
}
Which approach is faster? and for ~50 million elements, how much faster?
It seems to me that the first example would be faster because there are fewer conditional checks occurring for all but the first and last elements. However for the latter example there is some argument to suggest that the JIT might be clever enough to optimize away those additional head/last conditional checks that would otherwise occur for all but the first/last elements.
Is the JIT clever enough to perform such operations? The obvious advantage of the latter approach is that all business can be placed in the same function body while in the latter case business must be partitioned into three separate function bodies invoked separately.
** EDIT **
Thanks for all the great responses. While I am leaving the second code snippet above to illustrate its incorrectness, I want to revise the first approach slightly to reflect better the problem I am attempting to solve:
// x is some iterator
def process(x: Iterator[T]) = {
if (x.hasNext)
{
var previous = x.next
var current = null
processFirst previous
while(x.hasNext)
{
current = x.next
processPair(previous, current)
previous = current
}
processLast previous
}
}
While there are no additional checks occurring in the body, there is an additional reference assignment that appears to be unavoidable (previous = current). This is also a much more imperative approach that relies on nullable mutable variables. Implementing this in a functional yet high performance manner would be another exercise for another question.
How does this code snippet stack-up against the last of the two examples above? (the single-iteration block approach containing all the branches). The other thing I realize is that the latter of the two examples is also broken on collections containing fewer than two elements.
If your underlying collection has an inexpensive head and last method (not true for a generic Traversable), and the reduction operations are relatively inexpensive, then the second way takes about 10% longer (maybe a little less) than the first on my machine. (You can use a var to get first, and you can keep updating a second far with the right argument to obtain last, and then do the final operation outside of the loop.)
If you have an expensive last (i.e. you have to traverse the whole collection), then the first operation takes about 10% longer (maybe a little more).
Mostly you shouldn't worry too much about it and instead worry more about correctness. For instance, in a 2-element list your second code has a bug (because there is an else instead of a separate test). In a 1-element list, the second code never calls reduce's lambda at all, so again fails to work.
This argues that you should do it the first way unless you're sure last is really expensive in your case.
Edit: if you switch to a manual reduce-like-operation using an iterator, you might be able to shave off up to about 40% of your time compared to the expensive-last case (e.g. list). For inexpensive last, probably not so much (up to ~20%). (I get these values when operating on lengths of strings, for example.)
First of all, note that, depending on the concrete implementation of Traversable, doing something like x.last may be really expensive. Like, more expensive than all the rest of what's going on here.
Second, I doubt the cost of conditionals themselves is going to be noticeable, even on a 50 million collection, but actually figuring out whether a given element is the first or the last, might again, depending on implementation, get pricey.
Third, JIT will not be able to optimize the conditionals away: if there was a way to do that, you would have been able to write your implementation without conditionals to begin with.
Finally, if you are at a point where it starts looking like an extra if statement might affect performance, you might consider switching to java or even "C". Don't get me wrong, I love scala, it is a great language, with lots of power and useful features, but being super-fast just isn't one of them.

Is it worth it to rewrite an if statement to avoid branching?

Recently I realized I have been doing too much branching without caring the negative impact on performance it had, therefore I have made up my mind to attempt to learn all about not branching. And here is a more extreme case, in attempt to make the code to have as little branch as possible.
Hence for the code
if(expression)
A = C; //A and C have to be the same type here obviously
expression can be A == B, or Q<=B, it could be anything that resolve to true or false, or i would like to think of it in term of the result being 1 or 0 here
I have come up with this non branching version
A += (expression)*(C-A); //Edited with thanks
So my question would be, is this a good solution that maximize efficiency?
If yes why and if not why?
Depends on the compiler, instruction set, optimizer, etc. When you use a boolean expression as an int value, e.g., (A == B) * C, the compiler has to do the compare, and the set some register to 0 or 1 based on the result. Some instruction sets might not have any way to do that other than branching. Generally speaking, it's better to write simple, straightforward code and let the optimizer figure it out, or find a different algorithm that branches less.
Jeez, no, don't do that!
Anyone who "penalize[s] [you] a lot for branching" would hopefully send you packing for using something that awful.
How is it awful, let me count the ways:
There's no guarantee you can multiply a quantity (e.g., C) by a boolean value (e.g., (A==B) yields true or false). Some languages will, some won't.
Anyone casually reading it is going observe a calculation, not an assignment statement.
You're replacing a comparison, and a conditional branch with two comparisons, two multiplications, a subtraction, and an addition. Seriously non-optimal.
It only works for integral numeric quantities. Try this with a wide variety of floating point numbers, or with an object, and if you're really lucky it will be rejected by the compiler/interpreter/whatever.
You should only ever consider doing this if you had analyzed the runtime properties of the program and determined that there is a frequent branch misprediction here, and that this is causing an actual performance problem. It makes the code much less clear, and its not obvious that it would be any faster in general (this is something you would also have to measure, under the circumstances you are interested in).
After doing research, I came to the conclusion that when there are bottleneck, it would be good to include timed profiler, as these kind of codes are usually not portable and are mainly used for optimization.
An exact example I had after reading the following question below
Why is it faster to process a sorted array than an unsorted array?
I tested my code on C++ using that, that my implementation was actually slower due to the extra arithmetics.
HOWEVER!
For this case below
if(expression) //branched version
A += C;
//OR
A += (expression)*(C); //non-branching version
The timing was as of such.
Branched Sorted list was approximately 2seconds.
Branched unsorted list was aproximately 10 seconds.
My implementation (whether sorted or unsorted) are both 3seconds.
This goes to show that in an unsorted area of bottleneck, when we have a trivial branching that can be simply replaced by a single multiplication.
It is probably more worthwhile to consider the implementation that I have suggested.
** Once again it is mainly for the areas that is deemed as the bottleneck **

Reusing memory of immutable state in eager evaluation?

I'm studying purely functional language and currently thinking about some immutable data implementation.
Here is a pseudo code.
List a = [1 .. 10000]
List b = NewListWithoutLastElement a
b
When evaluating b, b must be copied in eager/strict implementation of immutable data.
But in this case, a is not used anymore in any place, so memory of 'a' can be re-used safely to avoid copying cost.
Furthermore, programmer can force compiler always do this by marking the type List with some keyword meaning must-be-disposed-after-using. Which makes compile time error on logic cannot avoid copying cost.
This can gain huge performance. Because it can be applied to huge object graph too.
How do you think? Any implementations?
This would be possible, but severely limited in scope. Keep in mind that the vast majority of complex values in a functional program will be passed to many functions to extract various properties from them - and, most of the time, those functions are themselves arguments to other functions, which means you cannot make any assumptions about them.
For example:
let map2 f g x = f x, g x
let apply f =
let a = [1 .. 10000]
f a
// in another file :
apply (map2 NewListWithoutLastElement NewListWithoutFirstElement)
This is fairly standard in functional code, and there is no way to place a must-be-disposed-after-using attribute on a because no specific location has enough knowledge about the rest of the program. Of course, you could try adding that information to the type system, but type inference on this is decidedly non-trivial (not to mention that types would grow quite large).
Things get even worse when you have compound objects, such as trees, that might share sub-elements between values. Consider this:
let a = binary_tree [ 1; 2; 5; 7; 9 ]
let result_1 = complex_computation_1 (insert a 6)
let result_2 = complex_computation_2 (remove a 5)
In order to allow memory reuse within complex_computation_2, you would need to prove that complex_computation_1 does not alter a, does not store any part of a within result_1 and is done using a by the time complex_computation_2 starts working. While the two first requirements might seem the hardest, keep in mind that this is a pure functional language: the third requirement actually causes a massive performance drop because complex_computation_1 and complex_computation_2 cannot be run on different threads anymore!
In practice, this is not an issue in the vast majority of functional languages, for three reasons:
They have a garbage collector built specifically for this. It is faster for them to just allocate new memory and reclaim the abandoned one, rather than try to reuse existing memory. In the vast majority of cases, this will be fast enough.
They have data structures that already implement data sharing. For instance, NewListWithoutFirstElement already provides full reuse of the memory of the transformed list without any effort. It's fairly common for functional programmers (and any kind of programmers, really) to determine their use of data structures based on performance considerations, and rewriting a "remove last" algorithm as a "remove first" algorithm is kind of easy.
Lazy evaluation already does something equivalent: a lazy list's tail is initially just a closure that can evaluate the tail if you need to—so there's no memory to be reused. On the other hand, this means that reading an element from b in your example would read one element from a, determine if it's the last, and return it without really requiring storage (a cons cell would probably be allocated somewhere in there, but this happens all the time in functional programming languages and short-lived small objects are perfectly fine with the GC).

Linq to Objects: filtering performance question

I was thinking about the way linq computes and it made me wonder:
If I write
var count = collection.Count(o => o.Category == 3);
Will that perform any differently than:
var count = collection.Where(o => o.Category == 3).Count();
After all, IEnumerable<T>.Where() will return IEnumerable<T> which doesn't implement Count property, so a subsequent Count() would actually have to walk through the items to determine the count which should cause extra time being spent on this.
I wrote some quick test code to get some metrics but they seem to beat each other at random. I won't put in the test code here initially, but if anyone requests, I'll get it in.
So, am I missing something?
There won't be a lot in it, really - both forms will iterate over the collection, check the predicate against each item, and count the matches. Both approaches will stream the data - it's not like Where is actually building an in-memory list of all matches, for example.
The first form has one fewer (thin) layer of indirection in, that's all. The main reason for using it (IMO) is for readability/simplicity, rather than performance.
As Jon Skeet says, both techniques will have to essentially do the same thing - enumerate the sequence while conditionally incrementing a counter when the predicate is matched. Any performance differences between the two should be slight: insignificant for almost all use-cases. If there is a token winner though, I would think it should be the first one, since from reflector it appears that the overload ofCountthat takes a predicate uses its ownforeachto enumerate rather than the more obvious way of offloading the work to a streaming aWhereinto a parameterlessCountas in your second example. This means technique #1 is likely to have two minor performance benefits:
Fewer argument validation (null-tests etc.) checks. Technique #2's Count will also check if its (piped) input is an ICollection or ICollection<T> , which it can't possibly be.
A single constructed enumerator vs two enumerators piped together (an additional state-machine has costs).
There is one minor in favour of technique #2 point though:Whereis slightly more sophisticated in constructing an enumerator for the source-sequence; it uses a different one for lists and arrays. This may make it more performant in certain scenarios.
Of course, I should reiterate that I might be plain wrong about my analysis - reasoning about performance through static code analysis, especially when the differences are likely to be slight, is not a good idea. There is only one way to find out - measuring the execution times for your specific setup.
FYI, the source I reflected was from .NET 3.5 SP1.
I know what you are thinking here. At least, I think I do; Count() will look to see if Count is available as a property, and will simply return that if so. Otherwise, it has to enumerate the items to get its return value.
The version of Count() which accepts the predicate, though, will always cause the collection to be iterated, since it has to do it to see which ones match.
Above answers make good points, consider also that if you break away into any Linq-To-X implementations that deferred execution (Linq to Sql being the primary), the Expression parameters used in these methods may cause different results.

Resources