What is the performance cost of converting between seqs and vectors? - performance

Many core Clojure functions return lazy sequences, even when vectors are passed into them. For example, if I had a vector of numbers, and wanted to filter them based on some predicate but get another vector back, I'd have to do something like this:
(into [] (filter my-pred my-vec))
Or:
(vec (filter my-pred my-vec))
Though I'm not sure if there's any meaningful difference between the two.
Is this operation expensive, or do you get it effectively for free, as when converting to/from a transient?
I understand that the seq is lazy so nothing will actually get calculated until you plop it into the output vector, but is there an overhead to converting from a seq and a concrete collection? Can it be characterized in terms of big-O, or does big-O not make sense here? What about the other way, when converting from a vector to a seq?

There's an FAQ in the Clojure site for good use cases for transducers, which could be handy for some complex transformations (more than just filtering, or when the predicate is fairly complex). Otherwise you can leverage on filterv, which is on the core library and you can assume it does any reasonable optimization for you.

TL;DR Don't worry about it
Longer version:
The main cost is memory allocation/GC. Usually this is trivial. If you have too much data to fit simultaneously in RAM, the lazy version can save you.
If you want to measure toy problems, you can experiment with the Criterium library. Try powers of 10 from 10^2 up to 10^9.
(crit/quick-bench (println :sum (reduce + 0 (into [] (range (Math/pow 10 N))))))
for N=2..9 with and without the (into [] ...) part.

Related

When and why to use hash tables in CL instead of a-lists?

I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
The a-list being the most important one to me. I use it all the time.
When and why do you (or should you) use hash tables?
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
Maybe I am missing something in my inexperience?
The hash table is very useful when you have to access a large set of values through a key, since the complexity of this operation with a hash table is O(1), while the complexity of the operation using an a-list is O(n), where n is the length of the list.
So, I use it when I need to access multiple times a set of values which has more then few elements.
There are lot of assumptions to address in your question:
I believe common lisp is the only language I have worked with that have a variety of extremely useful data structures.
I don't think this is particularly true, the standard libraries of popular languages are filled with lot of data structures too (C++, Java, Rust, Python)
When and why do you (or should you) use hash tables?
Data-structures come with costs in terms of memory and processor usage: a list must be searched linearly to find an element, whereas an hash-table has a constant lookup cost: for small lists however the linear search might be faster than the constant lookup. Moreover, there are other criteria like: do I want to access the data concurrently? a List can be manipulated in a purely functional way, making data-sharing across threads easier than with a hash-table (but hash-table can be associated with a mutex, etc.)
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists. Which honestly, I find weird considering almost everything is a list.
The source code of Lisp programs is made mostly of Lists and symbols, even if there is no such restriction. But at runtime, CL has a lot of different types that are not at all related to lists: bignums, floating points, rational numbers, complex numbers, vectors, arrays, packages, symbols, strings, classes and structures, hash-tables, readtables, functions, etc. You can model a lot of data at runtime by putting them in lists, which is something that works well for a lot of cases, but they are by far not the only types available.
Just to emphasize a little bit, when you write:
(vector 0 1 2)
This might look like a list in your code, but at runtime the value really is a different kind of object, a vector. Do not be confused by how things are expressed in code and how they are represented during code execution.
If you don't use it already, I suggest installing and using the Alexandria Lisp libray (see https://alexandria.common-lisp.dev/). There are useful functions to convert from and to hash-tables from alists or plists.
More generally, I think it is important to architecture your libraries and programs in a way that hide implementation details: you define a function make-person and accessors person-age, person-name, etc. as well as other user-facing functions. And the actual implementation can use hash tables, lists, etc. but this is not really a concern that should be exposed, because exposing that is a risk: you won't be able to easily change your mind later if you find out that the performance is bad or if you want to add a cache, use a database, etc.
I find however that CL is good at making nice interfaces that do not come with too much accidental complexity.
My reluctance to using them is that, unlike the other data structures, hashtables in CL are not visible lists.
They are definitely not lists, but indeed they are not visible either:
#<HASH-TABLE :TEST EQL :COUNT 1 {100F4BA883}>
this doesn't show what's inside the hash-table. During development it will require more steps to inspect what's inside (inspect, describe, alexandria:hash-table-alist, defining a non-portable print-object method…).
serapeum:dict
I like very much serapeum:dict, coupled with (serapeum:toggle-pretty-print-hash-table) (also the Cookbook).
CL-USER> (serapeum:dict :a 1 :b 2 :c 3)
;; => #<HASH-TABLE :TEST EQUAL :COUNT 3 {100F6012D3}>
CL-USER> (serapeum:toggle-pretty-print-hash-table)
;; print the above HT again:
CL-USER> **
(SERAPEUM:DICT
:A 1
:B 2
:C 3
)
Not only is it printed readably, but it allows to create the hash-table with initial elements at the same time (unlike make-hash-table) and you can read it back in. It's even easy to save such a structure on file.
Serapeum is a solid library.
Now, use hash-tables more easily.
When to use a hash-table: You need to do fast (approximately "constant time") look-ups of data.
When to use an a-list: You have a need to dynamically shadow data you pass on to functions.
If neither of these obviously apply, you have to make a choice. And then, possibly, benchmark your choice. And then evaluate if rewriting it using the other choice would be better. In some experimentation that someone else did, well over a decade ago, the trade-off between an a-list and a hash-map in most Common Lisp implementation is somewhere in the region of 5 to 20 keys.
However, if you have a need to "shadow" bindings, for functions you call, an a-list does provide that "for free", and a hash-map does not. So if that is something that your code does a lot of, an a-list MAY be the better choice.
* (defun lookup (alist key) (assoc key alist))
LOOKUP
* (lookup '((key1 . value1) (key2 . value2)) 'key1)
(KEY1 . VALUE1)
* (lookup '((key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE2)
* (lookup '((key2 . value3) (key1 . value1) (key2 . value2)) 'key2)
(KEY2 . VALUE3)

Lazy Evaluation: Why is it faster, advantages vs disadvantages, mechanics (why it uses less cpu; examples?) and simple proof of concept examples [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Lazy evaluation is said to be a way of delaying a process until the first time it is needed. This tends to avoid repeated evaluations and thats why I would imagine that is performing a lot faster.
Functional language like Haskell (and JavaScript..?) have this functionality built-in.
However, I don't understand how and why other 'normal' approaches (that is; same functionality but not using lazy evaluation) are slower.. how and why do these other approaches do repeated evaluations? Can someone elaborate on this by giving simple examples and explaining the mechanics of each approach?
Also, according to Wikipedia page about lazy evaluation these are said to be the advantages of this approach:
Performance increases by avoiding needless calculations, and error
conditions in evaluating compound expressions
The ability to construct potentially infinite data structures
The ability to define control flow (structures) as abstractions
instead of primitives
However, can we just control the calculations needed and avoid repeating the same ones? (1)
We can use i.e. a Linked List to create an infinite data structure (2)
Can we do (3) already..??? We can define classes/templates/objects and use those instead of primitives (i.e JavaScript).
Additionally, it seems to me that (at least from the cases i have seen), lazy evaluation goes hand-to-hand with recursion and using the 'head' and 'tail' (along with others) notions. Surely, there are cases where recursion is useful but is lazy evaluation something more than that...? more than a recursive approach to solving a problem..? Streamjs is JavaScript library that uses recursion along with some other simple operations (head,tail,etc) to perform lazy evaluation.
It seems i can't get my head around it...
Thanks in advance for any contribution.
I'll show examples in both Python 2.7 and Haskell.
Say, for example, you wanted to do a really inefficient sum of all the numbers from 0 to 10,000,000. You could do this with a for loop in Python as
total = 0
for i in range(10000000):
total += i
print total
On my computer, this takes about 1.3s to execute. If instead, I changed range to xrange (the generator form of range, lazily produces a sequence of numbers), it takes 1.2s, only slightly faster. However, if I check the memory used (using the memory_profiler package), the version with range uses about 155MB of RAM, while the xrange version uses only 1MB of RAM (both numbers not including the ~11MB Python uses). This is an incredibly dramatic difference, and we can see where it comes from with this tool as well:
Mem usage Increment Line Contents
===========================================
10.875 MiB 0.004 MiB total = 0
165.926 MiB 155.051 MiB for i in range(10000000):
165.926 MiB 0.000 MiB total += i
return total
This says that before we started we were using 10.875MB, total = 0 added 0.004MB, and then for i in range(10000000): added 155.051MB when it generated the entire list of numbers [0..9999999]. If we compare to the xrange version:
Mem usage Increment Line Contents
===========================================
11.000 MiB 0.004 MiB total = 0
11.109 MiB 0.109 MiB for i in xrange(10000000):
11.109 MiB 0.000 MiB total += i
return total
So we started with 11MB and for i in xrange(10000000): added only 0.109MB. This is a huge memory savings by only adding a single letter to the code. While this example is fairly contrived, it shows how not computing a whole list until the element is needed can make things a lot more memory efficient.
Python has iterators and generators which act as a sort of "lazy" programming for when you need to yield sequences of data (although there's nothing stopping you from using them for single values), but Haskell has laziness built into every value in the language, even user-defined ones. This lets you take advantage of things like data structures that won't fit in memory without having to program complicated ways around that fact. The canonical example would be the fibonacci sequence:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
which very elegantly expresses this famous sequence to define a recursive infinite list generating all fibonacci numbers. It's CPU efficient because all values are cached, so each element only has to be computed once (compared to a naive recursive implementation)1, but if you calculate too many elements your computer will eventually run out of RAM because you're now storing this huge list of numbers. This is an example where lazy programming lets you have CPU efficiency, but not RAM efficiency. There is a way around this, though. If you were to write
fib :: Int -> Integer
fib n = let fibs = 1 : 1 : zipWith (+) fibs (tail fibs) in fibs !! n
then this runs in near-constant memory, and does so very quickly, but memoization is lost as subsequent calls to fib have to recompute fibs.
A more complex example can be found here, where the author shows how to use lazy programming and recursion in Haskell to perform dynamic programming with arrays, a feat that most initially think is very difficult and requires mutation, but Haskell manages to do very easily with "tying the knot" style recursion. It results in both CPU and RAM efficiency, and does so in fewer lines than I'd expect in C/C++.
All this being said, there are plenty of cases where lazy programming is annoying. Often you can build up huge numbers of thunks instead of computing things as you go (I'm looking at you, foldl), and some strictness has to be introduced to attain efficiency. It also bites a lot of people with IO, when you read a file to a string as a thunk, close the file, and then try to operate on that string. It's only after the file is closed that the thunk gets evaluated, causing an IO error to occur and crashes your program. As with anything, lazy programming is not without its flaws, gotchas, and pitfalls. It takes time to learn how to work with it well, and to know what its limitations are.
1) By "naive recursive implementation", I mean implementing the fibonacci sequence as
fib :: Integer -> Integer
fib 0 = 1
fib 1 = 1
fib n = fib (n-1) + fib (n-2)
With this implementation, you can see the mathematical definition very clearly, it's very much in the style of inductive proofs, you show your base cases and then the general case. However, if I call fib 5, this will "expand" into something like
fib 5 = fib 4 + fib 3
= fib 3 + fib 2 + fib 2 + fib 1
= fib 2 + fib 1 + fib 1 + fib 0 + fib 1 + fib 0 + fib 1
= fib 1 + fib 0 + fib 1 + fib 1 + fib 0 + fib 1 + fib 0 + fib 1
= 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1
= 8
When instead we'd like to share some of those computations, that way fib 3 only gets computed once, fib 2 only gets computed once, etc.
By using a recursively defined list in Haskell, we can avoid this. Internally, this list is represented something like this:
fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
= 1 : 1 : zipWith (+) (f1:f2:fs) (f2:fs)
^--------------------^ ^ ^
^-------------------|-------|
= 1 : 1 : 2 : zipWith (+) (f2:f3:fs) (f3:fs)
^--------------------^ ^ ^
^-------------------|-------|
= 1 : 1 : 2 : 3 : zipWith (+) (f3:f4:fs) (f4:fs)
^--------------------^ ^ ^
^-------------------|-------|
So hopefully you can see the pattern forming here, as the list is build, it keeps pointers back to the last two elements generated in order to compute the next element. This means that for the nth element computed, there are n-2 additions performed. Even for the naive fib 5, you can see that there are more additions performed than that, and the number of additions will continue to grow exponentially. This definition is made possible through laziness and recursions, letting us turn an O(2^n) algorithm into an O(n) algorithm, but we have to give up RAM to do so. If this is defined at the top level, then values are cached for the lifetime of the program. It does mean that if you need to refer to the 1000th element repeatedly, you don't have to recompute it, just index it.
On the other hand, the definition
fib :: Int -> Integer
fib n =
let fibs = 1 : 1 : zipWith (+) fibs (tail fibs)
in fibs !! n
uses a local copy of fibs every time fib is called. We don't get caching between calls to fib, but we do get local caching, leaving our complexity O(n). Additionally, GHC is smart enough to know that we don't have to keep the beginning of the list around after we've used it to calculate the next element, so as we traverse fibs looking for the nth element, it only needs to hold on to 2-3 elements and a thunk pointing at the next element. This saves us RAM while computing it, and since it isn't defined at a global level it doesn't eat up RAM over the lifetime of the program. It's a tradeoff between when we want to spend RAM and CPU cycles, and different approaches are better for different situations. These techniques are applicable to much of Haskell programming in general, not just for this sequence!
Lazy evaluation is not, in general, faster. When it's said that lazy evaluation is more efficient, it is because when you consider Lambda Calculus (which is essentially what your Haskell programs are once the compiler finishes de-sugaring them) as a system of terms and reduction rules, then applying those rules in the order specified by the rules of a call-by-name with sharing evaluation policy always applies the same or fewer reduction rules than when you follow the rules in the order specified by call-by-value evaluation.
The reason that this theoretical result does not make lazy evaluation faster in general is that the translation to a linear sequential machine model with a memory access bottleneck tends to make all the reductions performed much more expensive! Initial attempts at implementing this model on computers led to programs that executed orders of magnitude more slowly than typical eagerly-evaluating language implementations. It has taken a lot of research and engineering into techniques for implementing lazy evaluation efficiently to get Haskell performance to where it is today. And the fastest Haskell programs take advantage of a form of static analysis called "strictness analysis" which attempts to determine at compile time which expressions will always be needed so that they can be evaluated eagerly rather than lazily.
There are still some cases where straightforward implementations of algorithms will execute faster in Haskell due to only evaluating terms that are needed for the result, but even eager languages always have some facility for evaluating some expressions by need. Conditionals and short-circuiting boolean expressions are ubiquitous examples, and in many eager languages, one can also delay evaluation by wrapping an expression in an anonymous function or some other sort of delaying form. So you can typically use these mechanisms (or even more awkward rewrites) to avoid evaluating expensive things that won't be necessary in an eager language.
The real advantage of Haskell's lazy evaluation is not a performance-related one. Haskell makes it easier to pull expressions apart, re-combine them in different ways, and generally reason about code as if it were a system of mathematical equations instead of being a sequentially-evaluated set of machine instructions. By not specifying any evaluation order, it forced the developers of the language to avoid side-effects that rely on a simple evaluation ordering, such as mutation or IO. This in turn led to a host of elegant abstractions that are generally useful and might not have been developed into usability otherwise.
The state of Haskell is now such that you can write high-level, elegant algorithms that make better re-use of existing higher-order functions and data structures than in nearly any other high-level typed language. And once you become familiar with the costs and benefits of lazy evaluation and how to control when it occurs, you can ensure that the elegant code also performs very well. But getting the elegant code to a state of high performance is not necessarily automatic and may require a bit more thought than in a similar but eagerly-evaluated language.
The concept of "lazy evaluation" is only about 1 thing, and only about that 1 thing:
The ability to postpone evaluation of something until needed
That's it.
Everything else in that wikipedia article follows from it.
Infinite data structures? Not a problem. We'll just make sure we don't actually figure out what the next element is until you actually ask for it. For instance, asking some code what the next value after X is, if the operation to perform is just to increase X by 1, will be infite. If you create a list containing all those values, it's going to fill your available memory in the computer. If you only figure out what the next value is when asked, not so much.
Needless calculations? Sure. You can return an object containing a lot of properties that when asked will provide you with some value. If you don't ask (ie. never inspect the value of a given property), the calculation necessary to figure out the value of that property will never be done.
Control flow ... ? Not at all sure what that is about.
The purpose of lazy evaluation of something is exactly as I stated to begin with, to avoid evaluating something until you actually need it. Be it the next value of something, the value of a property, whatever, adding support for lazy evaluation might conserve CPU cycles.
What would the alternative be?
I want to return an object to the calling code, containing any number of properties, some of which might be expensive to calculate. Without lazy evaluation, I would have to calculate the values of all those properties either:
Before constructing the object
After constructing the object, on the first time you inspected a property
After constructing the object, every time you inspected that property
With lazy evaluation you usually end up with number 2. You postpone evaluating the value of that property until some code inspects it. Note that you might cache the value once evaluated, which would save CPU cycles when inspecting the same property more than once, but that is caching, not quite the same, but in the same line of work: optimizations.

Haskell function nub inefficient

I'm confused by the implementation of the 'nub' (select unique values) function in the Haskell standard library Data.List. The GHC implementation is
nub l = nub' l []
where
nub' [] _ = []
nub' (x:xs) ls
| x `elem` ls = nub' xs ls
| otherwise = x : nub' xs (x:ls)
As far as I can tell, this has a worst-case time complexity of O(n^2), since for a list of unique values it has to compare them all once to see that they are in fact unique.
If one used a hash table, the complexity could be reduced to O(n) for building the table + O(1) for checking each value against previous values in the hash table. Granted, this would not produce an ordered list but that would also be possible in O(n log n) using GHC's own ordered Data.Map, if that is necessary.
Why choose such an inefficient implementation for an important library function? I understand efficiency is not a main concern in Haskell but at least the standard library could make an effort to choose the (asymptotically) best data structure for the job.
Efficiency is quite a concern in Haskell, after all the language performs on par with Java, and beats it in terms of memory consumption, but of course it's not C.
The answer to your question is pretty simple: the Prelude's nub requires only an Eq constraint, while any implementation based on Map or Set would also require either an Ord or Hashable.
You're absolutely correct - nub is an O(n^2) algorithm. However, there are still reasons why you might want to use it instead of using a hashmap:
for small lists it still might be faster
nub only requires the Eq constraint; by comparison Data.Map requires an Ord constraint on keys and Data.HashMap requires a key type with both Hashable and Ord type classes
it's lazy - you don't have to run through the entire input list to start getting results
Edit: Slight correction on the third point -- you don't have to process the entire list to start getting results; you'll still have to examine every element of the input list (so nub won't work on infinite lists), but you'll start returning results as soon as you find a unique element.
https://groups.google.com/forum/m/#!msg/haskell-cafe/4UJBbwVEacg/ieMzlWHUT_IJ
In my experience, "beginner" Haskell (including Prelude and the bad packages) simply ignores performance in many cases, in favor of simplicity.
Haskell performance is a complex problem to solve, so if you aren't experienced enough to search through Platform or Hackage for alternatives to the simple nub (and especially if your input is in a List just because you haven't thought about alternative structures), then Data.List.nub is likely not your only major performance problem and also you are probably writing code for a toy project where performance doesn't really matter.
You just have to have faith that when you get to building a large (in code or data) project, you will be more experienced and know how to set up your programs more efficiently.
In other words, don't worry about it, and assume that anything in Haskell 98 that comes from Prelude or base is likely to not be the most efficient way to solve a problem.

What is the fastest way to do integer division?

Using scheme I have the need to use the following function. (All args are natural numbers [0, inf) )
(define safe-div
(lambda (num denom safe)
(if (zero? denom)
safe
(div num denom))))
However, this function is called quite often and is not performing well enough (speed wise).
Is there a more efficient way of implementing the desired behavior (integer division of num and denom, return safe value if denom is zero)?
Notes, I am using Chez Scheme, however this is being used in a library that imports rnrs only, not full Chez.
For maximum performance, you need to get as close to the silicon as possible. Adding safety checks like this isn't going to do it, unless they get just-in-time compiled into ultra-efficient machine code by the scheme system.
I see two options. One is to create a native (i.e. foreign) implementation in C (or assembly) and invoke that. That might not be compatible with packaging it as a lambda, but then again, the dynamic nature of lambdas leads to notational efficiency but not necessarily runtime efficiency. (Function pointers excepted, there's a reason lambda expressions are not present in C, despite being many years older.) If you go this route, it might be best to take a step back and see if the larger processing of which safe-div is a part should be taken native. There's little point in speeding up the division at the center of a loop if everything around it is still slow.
Assuming that division by zero is expected to be rare, another approach is to just use div and hope its implementation is fast. Yes, this can lead to division by zero, but when it comes to speed, sometimes it is better to beg forgiveness than to ask permission. In other words, skip the checking before the division and just do it. If it fails, the scheme runtime should catch the division by zero fault and you can install an exception handler for it. This leads to slower code in the exceptional case and faster code in the normal case. Hopefully this tradeoff works out to a performance win.
Lastly, depending on what you are dividing by, it might be faster to multiply by the reciprocal than to perform an actual division. This requires fast reciprocal computation or revising earlier computations to yield a reciprocal directly. Since you are dealing with integers, the reciprocal would be stored in fixed-point, which is essentially 2^32 * 1/denom. Multiply this by num and shift right by 32 bits to get the quotient. This works out to a win because more processors these days have single cycle multiply instructions, but division executes in a loop on the chip, which is much slower. This might be overkill for your needs, but could be useful at some point.

Is Scala functional programming slower than traditional coding?

In one of my first attempts to create functional code, I ran into a performance issue.
I started with a common task - multiply the elements of two arrays and sum up the results:
var first:Array[Float] ...
var second:Array[Float] ...
var sum=0f;
for (ix<-0 until first.length)
sum += first(ix) * second(ix);
Here is how I reformed the work:
sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
When I benchmarked the two approaches, the second method takes 40 times as long to complete!
Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?
The main reasons why these two examples are so different in speed are:
the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
the faster one doesn't create temporary collections and, thus, avoids extra memory copies.
Let's consider the slower one by parts. First:
first.zip(second)
That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.
map{ case (a,b) => a*b }
Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).
We are not finished, though. Here's the next part:
reduceLeft(_+_)
That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.
Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:
sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:
sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:
sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.
The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.
So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.
I did some variations of this with Scala 2.8. The loop version is as you write but the
functional version is slightly different:
(xs, ys).zipped map (_ * _) reduceLeft(_ + _)
I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure
the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).
loopArray 461 437 436 437 435
reduceArray 6573 6544 6718 6828 6554
loopVector 5877 5773 5775 5791 5657
reduceVector 5064 4880 4844 4828 4926
loopArrayBoxed 2627 2551 2569 2537 2546
reduceArrayBoxed 4809 4434 4496 4434 4365
loopVectorBoxed 7577 7450 7456 7463 7432
reduceVectorBoxed 5116 4903 5006 4957 5122
The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.
If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.
If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.
So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.
This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,
zip . map . fold
Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.
Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.
There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.
So, in summary:
naive translations may have unexpected copies and inefficiences
clever FP compilers remove this overhead (but Scala can't yet)
sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it
Don Stewart has a fine answer, but it might not be obvious how going from one loop to three creates a factor of 40 slowdown. I'll add to his answer that Scala compiles to JVM bytecodes, and not only does the Scala compiler not fuse the three loops into one, but the Scala compiler is almost certainly allocating all the intermediate arrays. Notoriously, implementations of the JVM are not designed to handle the allocation rates required by functional languages. Allocation is a significant cost in functional programs, and that's one the loop-fusion transformations that Don Stewart and his colleagues have implemented for Haskell are so powerful: they eliminate lots of allocations. When you don't have those transformations, plus you're using an expensive allocator such as is found on a typical JVM, that's where the big slowdown comes from.
Scala is a great vehicle for experimenting with the expressive power of an unusual mix of language ideas: classes, mixins, modules, functions, and so on. But it's a relatively young research language, and it runs on the JVM, so it's unreasonable to expect great performance except on the kind of code that JVMs are good at. If you want to experiment with the mix of language ideas that Scala offers, great—it's a really interesting design—but don't expect the same performance on pure functional code that you'd get with a mature compiler for a functional language, like GHC or MLton.
Is scala functional programming slower than traditional coding?
Not necessarily. Stuff to do with first-class functions, pattern matching, and currying need not be especially slow. But with Scala, more than with other implementations of other functional languages, you really have to watch out for allocations—they can be very expensive.
The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.
That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:
class FastFloatOps(a: Array[Float]) {
def fastMapOnto(f: Float => Float) = {
var i = 0
while (i < a.length) { a(i) = f(a(i)); i += 1 }
this
}
def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
val len = a.length min b.length
val c = new Array[Float](len)
var i = 0
while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
c
}
def fastReduce(f: (Float,Float) => Float) = {
if (a.length==0) Float.NaN
else {
var r = a(0)
var i = 1
while (i < a.length) { r = f(r,a(i)); i += 1 }
r
}
}
}
implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)
and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)
Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.
I am not an expert Scala programmer, so there is probably a more efficient method, but what about something like this. This can be tail call optimized, so performance should be OK.
def multiply_and_sum(l1:List[Int], l2:List[Int], sum:Int):Int = {
if (l1 != Nil && l2 != Nil) {
multiply_and_sum(l1.tail, l2.tail, sum + (l1.head * l2.head))
}
else {
sum
}
}
val first = Array(1,2,3,4,5)
val second = Array(6,7,8,9,10)
multiply_and_sum(first.toList, second.toList, 0) //Returns: 130
To answer the question in the title: Simple functional constructs may be slower than imperative on the JVM.
But, if we consider only simple constructs, then we might as well throw out all modern languages and stick with C or assembler. If you look a the programming language shootout, C always wins.
So why choose a modern language? Because it lets you express a cleaner design. Cleaner design leads to performance gains in the overall operation of the application. Even if some low-level methods can be slower. One of my favorite examples is the performance of BuildR vs. Maven. BuildR is written in Ruby, an interpreted, slow, language. Maven is written in Java. A build in BuildR is twice as fast as Maven. This is due mostly to the design of BuildR which is lightweight compared with that of Maven.
Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:
let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )
may be rewritten using fold2 so as to avoid all temporary data structures:
let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys
This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.
Here is dbyrnes solution with arrays (assuming Arrays are to be used) and just iterating over the index:
def multiplyAndSum (l1: Array[Int], l2: Array[Int]) : Int =
{
def productSum (idx: Int, sum: Int) : Int =
if (idx < l1.length)
productSum (idx + 1, sum + (l1(idx) * l2(idx))) else
sum
if (l2.length == l1.length)
productSum (0, 0) else
error ("lengths don't fit " + l1.length + " != " + l2.length)
}
val first = (1 to 500).map (_ * 1.1) toArray
val second = (11 to 510).map (_ * 1.2) toArray
def loopi (n: Int) = (1 to n).foreach (dummy => multiplyAndSum (first, second))
println (timed (loopi (100*1000)))
That needs about 1/40 of the time of the list-approach. I don't have 2.8 installed, so you have to test #tailrec yourself. :)

Resources