I have a function that takes a parameter and produces a result. Unfortunately, it takes quite long for the function to produce the result. The function is being called quite often with the same input, that's why it would be convenient if I could cache the results. Something like
let cachedFunction = createCache slowFunction
in (cachedFunction 3.1) + (cachedFunction 4.2) + (cachedFunction 3.1)
I was looking into Data.Array and although the array is lazy, I need to initialize it with a list of pairs (using listArray) - which is impractical . If the 'key' is e.g. the 'Double' type, I cannot initialize it at all, and even if I can theoretically assign an Integer to every possible input, I have several tens of thousands possible inputs and I only actually use a handful. I would need to initialize the array (or, preferably a hash table, as only a handful of resutls will be used) using a function instead of a list.
Update: I am reading the memoization articles and as far as I understand it the MemoTrie could work the way I want. Maybe. Could somebody try to produce the 'cachedFunction'? Prefereably for a slow function that takes 2 Double arguments? Or, alternatively, that takes one Int argument in a domain of ~ [0..1 billion] that wouldn't eat all memory?
Well, there's Data.HashTable. Hash tables don't tend to play nicely with immutable data and referential transparency, though, so I don't think it sees a lot of use.
For a small number of values, stashing them in a search tree (such as Data.Map) would probably be fast enough. If you can put up with doing some mangling of your Doubles, a more robust solution would be to use a trie-like structure, such as Data.IntMap; these have lookup times proportional primarily to key length, and roughly constant in collection size. If Int is too limiting, you can dig around on Hackage to find trie libraries that are more flexible in the type of key used.
As for how to cache the results, I think what you want is usually called "memoization". If you want to compute and memoize results on demand, the gist of the technique is to define an indexed data structure containing all possible results, in such a way that when you ask for a specific result it forces only the computations needed to get the answer you want. Common examples usually involve indexing into a list, but the same principle should apply for any non-strict data structure. As a rule of thumb, non-function values (including infinite recursive data structures) will often be cached by the runtime, but not function results, so the trick is to wrap all of your computations inside a top-level definition that doesn't depend on any arguments.
Edit: MemoTrie example ahoy!
This is a quick and dirty proof of concept; better approaches may exist.
{-# LANGUAGE TypeFamilies #-}
{-# LANGUAGE TypeOperators #-}
import Data.MemoTrie
import Data.Binary
import Data.ByteString.Lazy hiding (map)
mangle :: Double -> [Int]
mangle = map fromIntegral . unpack . encode
unmangle :: [Int] -> Double
unmangle = decode . pack . map fromIntegral
instance HasTrie Double where
data Double :->: a = DoubleTrie ([Int] :->: a)
trie f = DoubleTrie $ trie $ f . unmangle
untrie (DoubleTrie t) = untrie t . mangle
slow x
| x < 1 = 1
| otherwise = slow (x / 2) + slow (x / 3)
memoSlow :: Double -> Integer
memoSlow = memo slow
Do note the GHC extensions used by the MemoTrie package; hopefully that isn't a problem. Load it up in GHCi and try calling slow vs. memoSlow with something like (10^6) or (10^7) to see it in action.
Generalizing this to functions taking multiple arguments or whatnot should be fairly straightforward. For further details on using MemoTrie, you might find this blog post by its author helpful.
See memoization
There are a number of tools in GHC's runtime system explicitly to support memoization.
Unfortunately, memoization isn't really a one-size fits all affair, so there are several different approaches that we need to support in order to cope with different user needs.
You may find the original 1999 writeup useful as it includes several implementations as examples:
Stretching the Storage Manager: Weak Pointers and Stable Names in Haskell by Simon Peyton Jones, Simon Marlow, and Conal Elliott
I will add my own solution, which seems to be quite slow as well. First parameter is a function that returns Int32 - which is unique identifier of the parameter. If you want to uniquely identify it by different means (e.g. by 'id'), you have to change the second parameter in H.new to a different hash function. I will try to find out how to use Data.Map and test if I get faster results.
import qualified Data.HashTable as H
import Data.Int
import System.IO.Unsafe
cache :: (a -> Int32) -> (a -> b) -> (a -> b)
cache ident f = unsafePerformIO $ createfunc
where
createfunc = do
storage <- H.new (==) id
return (doit storage)
doit storage = unsafePerformIO . comp
where
comp x = do
look <- H.lookup storage (ident x)
case look of
Just res -> return res
Nothing -> do
result <- return (f x)
H.insert storage (ident x) result
return result
You can write the slow function as a higher order function, returning a function itself. Thus you can do all the preprocessing inside the slow function and the part that is different in each computation in the returned (hopefully fast) function. An example could look like this:
(SML code, but the idea should be clear)
fun computeComplicatedThing (x:float) (y:float) = (* ... some very complicated computation *)
fun computeComplicatedThingFast = computeComplicatedThing 3.14 (* provide x, do computation that needs only x *)
val result1 = computeComplicatedThingFast 2.71 (* provide y, do computation that needs x and y *)
val result2 = computeComplicatedThingFast 2.81
val result3 = computeComplicatedThingFast 2.91
I have several tens of thousands possible inputs and I only actually use a handful. I would need to initialize the array ... using a function instead of a list.
I'd go with listArray (start, end) (map func [start..end])
func doesn't really get called above. Haskell is lazy and creates thunks which will be evaluated when the value is actually required.
When using a normal array you always need to initialize its values. So the work required for creating these thunks is necessary anyhow.
Several tens of thousands is far from a lot. If you'd have trillions then I would suggest to use a hash table yada yada
I don't know haskell specifically, but how about keeping existing answers in some hashed datastructure (might be called a dictionary, or hashmap)? You can wrap your slow function in another function that first check the map and only calls the slow function if it hasn't found an answer.
You could make it fancy by limiting the size of the map to a certain size and when it reaches that, throwing out the least recently used entry. For this you would additionally need to keep a map of key-to-timestamp mappings.
Related
I extracted the following minimal example from my production project. My machine learning project is made up of a linear algebra library, a deep learning library, and an application.
The linear algebra library contains a module for matrices based on storable vectors:
module Matrix where
import Data.Vector.Storable hiding (sum)
data Matrix a = Matrix { rows :: Int, cols :: Int, items :: Vector a } deriving (Eq, Show, Read)
item :: Storable a => Int -> Int -> Matrix a -> a
item i j m = unsafeIndex (items m) $ i * cols m + j
multiply :: Storable a => Num a => Matrix a -> Matrix a -> Matrix a
multiply a b = Matrix (rows a) (cols b) $ generate (rows a * cols b) (f . flip divMod (cols b)) where
f (i, j) = sum $ (\ k -> item i k a * item k j b) <$> [0 .. cols a - 1]
The deep learning library uses the linear algebra library to implement the forward pass through a deep neural network:
module Deep where
import Foreign.Storable
import Matrix
transform :: Storable a => Num a => [Matrix a] -> Matrix a -> Matrix a
transform layers batch = foldr multiply batch layers
And finally the application uses the deep learning library:
import qualified Data.Vector.Storable as VS
import Test.Tasty.Bench
import Matrix
import Deep
main :: IO ()
main = defaultMain [bmultiply] where
bmultiply = bench "bmultiply" $ nf (items . transform layers) batch where
m k l c = Matrix k l $ VS.replicate (k * l) c :: Matrix Double
layers = m 256 256 <$> [0.1, 0.2, 0.3]
batch = m 256 100 0.4
I like the fact that the deep learning library and with some exceptions related to BLAS via FFI also the linear algebra library do not have to worry about concrete types like Float or Double. Unfortunately, this also means that unless specialization takes place, they use boxed values and performance is about 60x worse than it could be (959 ms instead of 16.7 ms).
The only way I have found to get good performance is to force either inlining or specialization throughout the entire call hierarchy via compiler pragmas. This is very annoying because the performance issue that fundamentally should be specific to the multiply function now "infects" the entire code base. Even very high-level functions using multiply via 5 levels of indirection and several intermediate libraries somehow have to "know" about technical specialization issues deep down.
In my actual production code, many more functions are affected than in this minimal example. Forgetting to annotate just a single one of these functions with the right compiler pragma immediately destroys the performance. Additionally, when developing a library, I have no way of knowing which types it will be used with, so specialization pragmas are not an option anyways.
This is particularly unfortunate because all the performance-critical tight loops are wholly contained within the multiply function. The function itself is only called a handful of times and it would not hurt performance if values were only unboxed dynamically whenever multiply is called. In the end, there is really no need for values to be specialized and unboxed inside the high-level machine learning functions. I feel like there should be a way to pass the request for specialization through to the low-level functions while keeping high- and intermediate-level functions polymorphic.
How is this problem typically solved in Haskell? If I develop a library that uses the vector package to generate blazingly-fast code in tight loops, how do I pass that performance on to users of my library without losing all polymorphism or forcing everything to be inlined?
Is there a way to pay the price for polymorphism (in the form of boxing) only within the high-level functions and specialize and unbox only at the boundary to the functions that need it, rather than having specialization "infect" the entire call hierarchy?
If you browse the source for, say, the vector package, you'll find that nearly every function has an INLINABLE or INLINE pragma, whether the function is part of the low-level, performance critical core or part of a high-level generic interface. You'll see something similar if you look at lens or hmatrix, etc.
So, the short answer is: no, the only way to get good performance with your current design will be to infect the entire call hierarchy with pragmas. The best way to avoid missing a pragma and tanking performance will be to have an exhaustive set of benchmarks that can detect performance regressions.
There are a few compiler flags that might be helpful. The flag -fexpose-all-unfoldings makes sure that inlinable versions of all functions find their way into the interface files, while the flag -fspecialise-aggressively looks for any opportunity to specialize those functions. Together, they are kind of like turning on INLINE for every function. This probably isn't a good permanent solution, but it might be useful during development or as a sanity check to get some baseline performance numbers.
I have a program which traverses an expression tree that does algebra on probability distributions, either sampling or computing the resulting distribution.
I have two implementations computing the distribution: one (computeDistribution) nicely reusable with monad transformers and one (simpleDistribution) where I concretize everything by hand. I would like to not concretize everything by hand, since that would be code duplication between the sampling and computing code.
I also have two data representations:
type Measure a = [(a, Rational)]
-- data Distribution a = Distribution (Measure a) deriving Show
newtype Distribution a = Distribution (Measure a) deriving Show
When I use the data version with the reusable code, computing the distribution of 20d2 (ghc -O3 program.hs; time ./program 20 > /dev/null) takes about one second, which seems way too long. Pick higher values of n at your own peril.
When I use the hand-concretized code, or I use the newtype representation with either implementation, computing 20d2 (time ./program 20 s > /dev/null) takes the blink of an eye.
Why?
How can I find out why?
My knowledge of how Haskell is executed is almost nil. I gather there's a graph of thunks in basically the same shape as the program, but that's about all I know.
I figure with newtype the representation of Distribution is the same as that of Measure, i.e. it's just a list, whereas with the data version each Distribution is kinda' like a single-field record, except with a pointer to the contained list, and so the data version has to perform more allocations. Is this true? If true, is this enough to explain the performance difference?
I'm new to working with monad transformer stacks. Consider the Let and Uniform cases in simpleDistribution — do they do the same as the walkTree-based implementation? How do I tell?
Here's my program. Note that Uniform n corresponds to rolling an n-sided die (in case the unary-ness was surprising).
Update: based on comments I simplified my program by removing everything not contributing to the performance gap. I made two semantic changes: probabilities are now denormalized and all wonky and wrong, and the simplification step is gone. But the essential shape of my program is still there. (See question edit history for the non-simplified program.)
Update 2: I made further simplifications, reducing Distribution down to the list monad with a small twist, removing everything to do with probabilities, and shortening the names. I still observe large performance differences when using data but not newtype.
import Control.Monad (liftM2)
import Control.Monad.Trans (lift)
import Control.Monad.Reader (ReaderT, runReaderT)
import System.Environment (getArgs)
import Text.Read (readMaybe)
main = do
args <- getArgs
let dieCount = case map readMaybe args of Just n : _ -> n; _ -> 10
let f = if ["s"] == (take 1 $ drop 1 $ args) then fast else slow
print $ f dieCount
fast, slow :: Int -> P Integer
fast n = walkTree n
slow n = walkTree n `runReaderT` ()
walkTree 0 = uniform
walkTree n = liftM2 (+) (walkTree 0) (walkTree $ n - 1)
data P a = P [a] deriving Show
-- newtype P a = P [a] deriving Show
class Monad m => MonadP m where uniform :: m Integer
instance MonadP P where uniform = P [1, 1]
instance MonadP p => MonadP (ReaderT env p) where uniform = lift uniform
instance Functor P where fmap f (P pxs) = P $ fmap f pxs
instance Applicative P where
pure x = P [x]
(P pfs) <*> (P pxs) = P $ pfs <*> pxs
instance Monad P where
(P pxs) >>= f = P $ do
x <- pxs
case f x of P fxs -> fxs
How can I find out why?
This is, in general, hard.
The extreme way to do it is to look at the core code (which you can produce by running GHC with -ddump-simpl). This can get complicated really quickly, and it's basically a whole new language to learn. Your program is already big enough that I had trouble learning much from the core dump.
The other way to find out why is to just keep using GHC and asking questions and learning about GHC optimizations until you recognize certain patterns.
Why?
In short, I believe it's due to list fusion.
NOTE: I don't know for sure that this answer is correct, and it would take more time/work to verify than I'm willing to put in right now. That said, it fits the evidence.
First off, we can check whether this slowdown you're seeing is a result of something truly fundamental vs a GHC optimization triggering or not by running in O0, that is, without optimizations. In this mode, both Distribution representations result in about the same (excruciatingly long) runtime. This leads me to believe that it's not the data representation that is inherently the problem but rather there's an optimization that's triggered with the newtype version that isn't with the data version.
When GHC is run in -O1 or higher, it engages certain rewrite rules to fuse different folds and maps of lists together so that it doesn't need to allocate intermediate values. (See https://markkarpov.com/tutorial/ghc-optimization-and-fusion.html#fusion for a decent tutorial on this concept as well as https://stackoverflow.com/a/38910170/14802384 which additionally has a link to a gist with all of the rewrite rules in base.) Since computeDistribution is basically just a bunch of list manipulations (which are all essentially folds), there is the potential for these to fire.
The key is that with the newtype representation of Distribution, the newtype wrapper is erased during compilation, and the list operations are allowed to fuse. However, with the data representation, the wrappers are not erased, and the rewrite rules do not fire.
Therefore, I will make an unsubstantiated claim: If you want your data representation to be as fast as the newtype one, you will need to set up rewrite rules similar to the ones for list folding but that work over the Distribution type. This may involve writing your own special fold functions and then rewriting your Functor/Applicative/Monad instances to use them.
Coming back to Haskell four years after the first glimpse of it. I'm always as amazed by the expressiveness, and as baffled by my inability to predict space/time performance.
As a warm up, I took to translating a tiny toy program I had written in C++. It's about "cheating" at Scrabble. You input your game, and it outputs possible words you may play, with your letters alone or by crossing a letter on the board.
The whole thing revolves around a dictionary that's preloaded at start up. The words are then stored as lists in a map, together with their anagrams. The keys are strings of sorted letters. An example will speak more clearly :
Key : "AEHPS" Value : ["HEAPS","PHASE","SHAPE"]
The C++ version reads the ~320000 words of the dictionary one at a time, in about 200ms in total. The resulting data structure is a hash-map stored in a array<99991, vector<string>> and takes up about 12 megabytes of memory.
The Haskell version reads the same dictionary in about 5 seconds, and the program heap size was ballooning up to 400 megabytes ! I changed the value type in the Data.Map from [String] to [ByteString] to save some memory, and that brought the program memory consumption down to roughly 290 megabytes. That's still 24 times more than my C++ version. That's more than mere "overhead", even though Data.Map is a tree instead of an array.
So I assume that I'm doing something wrong.
The whole module is visible here : (deprecated link)
I suppose that my trouble has something to do with the way that Data.Map gets built incrementally, growing upon previous versions of itself ? Or with the data structure itself ? Or something else ?
I'll try other solutions, like Data.HashMap, or filling up Data.Map with fromListWith. Nevertheless, I'd like to build up some understanding of what's going on here. Thanks a lot for any insight !
Short answer :
Using Data.Map.Strict, forcing the evaluation of value elements, and storing the keys as ByteStrings too made the miracle of dividing the memory footprint nearly by 3. The result is 100Meg, which is only twice larger than a standard std::multimap<std::string, std::string> in C++ for the same dataset. No speedup though. Git updated.
Thanks a lot to all who contributed, there is interesting material down here !
One mistake that you are making that hasn't been pointed out yet is that you are storing unevaluated thunks of the form B.pack word in the lists that are the values in your Map. That means that you are retaining essentially the entire input file in the inefficient String format during the construction of your Map, at a cost of 24 bytes per character in your input file. Using the Data.Map.Strict API makes no difference here, since the functions in that API only force the elements of the Map to weak head normal form, which for a list means only evaluating whether the outermost constructor is [] or (:), and not evaluating any of the list's elements.
Another improvement you can make is to use the ShortByteString type available in recent versions of bytestring (the one that comes with GHC 7.8 is new enough). This is specifically designed for minimizing memory usage when storing many short bytestrings, with the trade-off being that most operations on a ShortByteString require a copy.
András Kovács's Map example code would look like this with these changes:
{-# LANGUAGE BangPatterns #-}
import Control.Applicative
import Data.List
import qualified Data.Map.Strict as M
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Short as B (ShortByteString, toShort, fromShort)
shortPack = B.toShort . B.pack
main = do
words <- lines <$> readFile "dict.txt"
print $ M.size $
M.fromListWith (++) $ map (\w -> let !x = shortPack w in (shortPack $ sort w, [x])) words
Each of these changes saves about 30% of the maximum residency in my tests, for a total of over 50% space usage savings.
EDIT: Removed erroneous benchmarking and commentary, only left a minor bit of advice. See Reid Barton's answer for a direct solution to OP's question.
If you don't need to change the dictionary at runtime, then DAWG-s are pretty much the most space-time efficient solution you can get (at least for word games).
For example, we can generate and serialize a DAWG from your dictionary that takes only 295 Kb space, and supports quite efficient lookup and prefix matching:
import qualified Data.DAWG.Packed as D -- from my "packed-dawg" package
main = do
words <- lines <$> readFile "dict.txt"
D.toFile "dict.dawg" $ D.fromList words -- serialize as "dict.dawg"
The following works in around a second on my laptop:
import qualified Data.Map as M
import qualified Data.Text as T
import qualified Data.Text.IO as T
import Data.List
type Dict = M.Map T.Text [T.Text]
newDict = M.empty
addWord:: T.Text -> Dict -> Dict
addWord word dict = M.insertWith (++) (T.pack $ sort $ T.unpack word) [word] dict
loadAnagramsFromFile fileName = do
full <- T.readFile fileName
let ls = T.lines full
return $ foldr addWord newDict lsct ls
This is using Text, with just a wart for sorting. There may be a better way of sorting a Text.
I have a fair bit of understanding of haskell but I am always little unsure about what kind of pragmas and optimizations I should use and where. Like
Like when to use SPECIALIZE pragma and what performance gains it has.
Where to use RULES. I hear people taking about a particular rule not firing? How do we check that?
When to make arguments of a function strict and when does that help? I understand that making argument strict will make the arguments to be evaluated to normal form, then why should I not add strictness to all function arguments? How do I decide?
How do I see and check I have a space leak in my program? What are the general patterns which constitute to a space leak?
How do I see if there is a problem with too much lazyness? I can always check the heap profiling but I want to know what are the general cause, examples and patterns where lazyness hurts?
Is there any source which talks about advanced optimizations (both at higher and very low levels) especially particular to haskell?
Like when to use SPECIALIZE pragma and what performance gains it has.
You let the compiler specialise a function if you have a (type class) polymorphic function, and expect it to be called often at one or a few instances of the class(es).
The specialisation removes the dictionary lookup where it is used, and often enables further optimisation, the class member functions can often be inlined then, and they are subject to strictness analysis, both give potentially huge performance gains. If the only optimisation possible is the elimination of the dicitonary lookup, the gain won't generally be huge.
As of GHC-7, it's probably more useful to give the function an {-# INLINABLE #-} pragma, which makes its (nearly unchanged, some normalising and desugaring is performed) source available in the interface file, so the function can be specialised and possibly even inlined at the call site.
Where to use RULES. I hear people taking about a particular rule not firing? How do we check that?
You can check which rules have fired by using the -ddump-rule-firings command line option. That usually dumps a large number of fired rules, so you have to search a bit for your own rules.
You use rules
when you have a more efficient version of a function for special types, e.g.
{-# RULES
"realToFrac/Float->Double" realToFrac = float2Double
#-}
when some functions can be replaced with a more efficient version for special arguments, e.g.
{-# RULES
"^2/Int" forall x. x ^ (2 :: Int) = let u = x in u*u
"^3/Int" forall x. x ^ (3 :: Int) = let u = x in u*u*u
"^4/Int" forall x. x ^ (4 :: Int) = let u = x in u*u*u*u
"^5/Int" forall x. x ^ (5 :: Int) = let u = x in u*u*u*u*u
"^2/Integer" forall x. x ^ (2 :: Integer) = let u = x in u*u
"^3/Integer" forall x. x ^ (3 :: Integer) = let u = x in u*u*u
"^4/Integer" forall x. x ^ (4 :: Integer) = let u = x in u*u*u*u
"^5/Integer" forall x. x ^ (5 :: Integer) = let u = x in u*u*u*u*u
#-}
when rewriting an expression according to general laws might produce code that's better to optimise, e.g.
{-# RULES
"map/map" forall f g. (map f) . (map g) = map (f . g)
#-}
Extensive use of RULES in the latter style is made in fusion frameworks, for example in the text library, and for the list functions in base, a different kind of fusion (foldr/build fusion) is implemented using rules.
When to make arguments of a function strict and when does that help? I understand that making argument strict will make the arguments to be evaluated to normal form, then why should I not add strictness to all function arguments? How do I decide?
Making an argument strict will ensure that it is evaluated to weak head normal form, not to normal form.
You do not make all arguments strict because some functions must be non-strict in some of their arguments to work at all and some are less efficient if strict in all arguments.
For example partition must be non-strict in its second argument to work at all on infinite lists, more general every function used in foldr must be non-strict in the second argument to work on infinite lists. On finite lists, having the function non-strict in the second argument can make it dramatically more efficient (foldr (&&) True (False:replicate (10^9) True)).
You make an argument strict, if you know that the argument must be evaluated before any worthwhile work can be done anyway. In many cases, the strictness analyser of GHC can do that on its own, but of course not in all.
A very typical case are accumulators in loops or tail recursions, where adding strictness prevents the building of huge thunks on the way.
I know no hard-and-fast rules for where to add strictness, for me it's a matter of experience, after a while you learn in what places adding strictness is likely to help and where to harm.
As a rule of thumb, it makes sense to keep small data (like Int) evaluated, but there are exceptions.
How do I see and check I have a space leak in my program? What are the general patterns which constitute to a space leak?
The first step is to use the +RTS -s option (if the programme was linked with rtsopts enabled). That shows you how much memory was used overall, and you can often judge by that whether you have a leak.
A more informative output can be obtained from running the programme with the +RTS -hT option, that produces a heap profile that can help locating the space leak (also, the programme needs to be linked with enabled rtsopts).
If further analysis is required, the programme needs to be compiled with profiling enabled (-rtsops -prof -fprof-auto, in older GHCs, the -fprof-auto option wasn't available, the -prof-auto-all option is the closest correspondence there).
Then you run it with various profiling options and look at the generated heap profiles.
The two most common causes for space leaks are
too much laziness
too much strictness
the third place is probably taken by unwanted sharing, GHC does little common subexpression elimination, but it occasionally shares long lists even where not wanted.
For finding the cause of a leak, I know again no hard-and-fast rules, and occasionally, a leak can be fixed by adding strictness in one place or by adding laziness in another.
How do I see if there is a problem with too much lazyness? I can always check the heap profiling but I want to know what are the general cause, examples and patterns where lazyness hurts?
Generally, laziness is wanted where results can be built up incrementally, and unwanted where no part of the result can be delivered before processing is complete, like in left folds or generally in tail-recursive functions.
I recommend reading the GHC documentation on Pragmas and Rewrite Rules, as they address many of your questions about SPECIALIZE and RULES.
To briefly address your questions:
SPECIALIZE is used to force the compiler to build a specialized version of a polymorphic function for a particular type. The advantage is that applying the function in that case will no longer require the dictionary. The disadvantage is that it will increase the size of your program. Specialization is particularly valuable for functions called in "inner-loops", and it's essentially useless for infrequently called top-level functions. Refer to the GHC documentation for interactions with INLINE.
RULES allows you to specify rewrite rules that you know to be valid but the compiler couldn't infer on its own. The common example is {-# RULES "mapfusion" forall f g xs. map f (map g xs) = map (f.g) xs #-}, which tells GHC how to fuse map. It can be finicky to get GHC to use the rules because of interference with INLINE. 7.19.3 touches on how to avoid conflicts and also how to force GHC to use a rule even when it would normally avoid it.
Strict arguments are most vital for something like an accumulator in a tail-recursive function. You know that the value will ultimately be fully calculated, and building up a stack of closures to delay the computation completely defeats the purpose. Enforced strictness must naturally be avoided anytime the function may be applied to a value which must be processed lazily, like an infinite list. Generally, the best idea is to initially only force strictness where it's obviously useful (like accumulators), and then add more later only as profiling shows it's needed.
My experience has been that most show-stopping space leaks came from lazy accumulators and unevaluated lazy values in very large data-structures, although I'm sure this is specific to the kinds of programs you're writing. Using unboxed data-structures whenever possible fixes a lot of the problems.
Outside of the instances where laziness causes space-leaks, the major situation where it should be avoided is in IO. Lazily processing resource inherently increases the amount of wall-clock time that the resource is needed. This can be bad for cache performance, and it's obviously bad if something else wants exclusive rights to use the same resource.
In one of my first attempts to create functional code, I ran into a performance issue.
I started with a common task - multiply the elements of two arrays and sum up the results:
var first:Array[Float] ...
var second:Array[Float] ...
var sum=0f;
for (ix<-0 until first.length)
sum += first(ix) * second(ix);
Here is how I reformed the work:
sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
When I benchmarked the two approaches, the second method takes 40 times as long to complete!
Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?
The main reasons why these two examples are so different in speed are:
the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
the faster one doesn't create temporary collections and, thus, avoids extra memory copies.
Let's consider the slower one by parts. First:
first.zip(second)
That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.
map{ case (a,b) => a*b }
Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).
We are not finished, though. Here's the next part:
reduceLeft(_+_)
That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.
Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:
sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:
sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:
sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.
The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.
So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.
I did some variations of this with Scala 2.8. The loop version is as you write but the
functional version is slightly different:
(xs, ys).zipped map (_ * _) reduceLeft(_ + _)
I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure
the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).
loopArray 461 437 436 437 435
reduceArray 6573 6544 6718 6828 6554
loopVector 5877 5773 5775 5791 5657
reduceVector 5064 4880 4844 4828 4926
loopArrayBoxed 2627 2551 2569 2537 2546
reduceArrayBoxed 4809 4434 4496 4434 4365
loopVectorBoxed 7577 7450 7456 7463 7432
reduceVectorBoxed 5116 4903 5006 4957 5122
The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.
If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.
If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.
So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.
This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,
zip . map . fold
Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.
Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.
There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.
So, in summary:
naive translations may have unexpected copies and inefficiences
clever FP compilers remove this overhead (but Scala can't yet)
sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it
Don Stewart has a fine answer, but it might not be obvious how going from one loop to three creates a factor of 40 slowdown. I'll add to his answer that Scala compiles to JVM bytecodes, and not only does the Scala compiler not fuse the three loops into one, but the Scala compiler is almost certainly allocating all the intermediate arrays. Notoriously, implementations of the JVM are not designed to handle the allocation rates required by functional languages. Allocation is a significant cost in functional programs, and that's one the loop-fusion transformations that Don Stewart and his colleagues have implemented for Haskell are so powerful: they eliminate lots of allocations. When you don't have those transformations, plus you're using an expensive allocator such as is found on a typical JVM, that's where the big slowdown comes from.
Scala is a great vehicle for experimenting with the expressive power of an unusual mix of language ideas: classes, mixins, modules, functions, and so on. But it's a relatively young research language, and it runs on the JVM, so it's unreasonable to expect great performance except on the kind of code that JVMs are good at. If you want to experiment with the mix of language ideas that Scala offers, great—it's a really interesting design—but don't expect the same performance on pure functional code that you'd get with a mature compiler for a functional language, like GHC or MLton.
Is scala functional programming slower than traditional coding?
Not necessarily. Stuff to do with first-class functions, pattern matching, and currying need not be especially slow. But with Scala, more than with other implementations of other functional languages, you really have to watch out for allocations—they can be very expensive.
The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.
That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:
class FastFloatOps(a: Array[Float]) {
def fastMapOnto(f: Float => Float) = {
var i = 0
while (i < a.length) { a(i) = f(a(i)); i += 1 }
this
}
def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
val len = a.length min b.length
val c = new Array[Float](len)
var i = 0
while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
c
}
def fastReduce(f: (Float,Float) => Float) = {
if (a.length==0) Float.NaN
else {
var r = a(0)
var i = 1
while (i < a.length) { r = f(r,a(i)); i += 1 }
r
}
}
}
implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)
and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)
Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.
I am not an expert Scala programmer, so there is probably a more efficient method, but what about something like this. This can be tail call optimized, so performance should be OK.
def multiply_and_sum(l1:List[Int], l2:List[Int], sum:Int):Int = {
if (l1 != Nil && l2 != Nil) {
multiply_and_sum(l1.tail, l2.tail, sum + (l1.head * l2.head))
}
else {
sum
}
}
val first = Array(1,2,3,4,5)
val second = Array(6,7,8,9,10)
multiply_and_sum(first.toList, second.toList, 0) //Returns: 130
To answer the question in the title: Simple functional constructs may be slower than imperative on the JVM.
But, if we consider only simple constructs, then we might as well throw out all modern languages and stick with C or assembler. If you look a the programming language shootout, C always wins.
So why choose a modern language? Because it lets you express a cleaner design. Cleaner design leads to performance gains in the overall operation of the application. Even if some low-level methods can be slower. One of my favorite examples is the performance of BuildR vs. Maven. BuildR is written in Ruby, an interpreted, slow, language. Maven is written in Java. A build in BuildR is twice as fast as Maven. This is due mostly to the design of BuildR which is lightweight compared with that of Maven.
Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:
let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )
may be rewritten using fold2 so as to avoid all temporary data structures:
let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys
This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.
Here is dbyrnes solution with arrays (assuming Arrays are to be used) and just iterating over the index:
def multiplyAndSum (l1: Array[Int], l2: Array[Int]) : Int =
{
def productSum (idx: Int, sum: Int) : Int =
if (idx < l1.length)
productSum (idx + 1, sum + (l1(idx) * l2(idx))) else
sum
if (l2.length == l1.length)
productSum (0, 0) else
error ("lengths don't fit " + l1.length + " != " + l2.length)
}
val first = (1 to 500).map (_ * 1.1) toArray
val second = (11 to 510).map (_ * 1.2) toArray
def loopi (n: Int) = (1 to n).foreach (dummy => multiplyAndSum (first, second))
println (timed (loopi (100*1000)))
That needs about 1/40 of the time of the list-approach. I don't have 2.8 installed, so you have to test #tailrec yourself. :)