Multiplying matrices with repa

Multiplying matrices with repa - performance

I'm playing around with writing a JPEG decoder in pure Haskell (a chance to learn repa for me!), and right now I'm working on the IDCT (which amounts to multiplying a bunch of 8×8 matrices by a ﬁxed 8×8 matrix). The function idctBlocks at the end of the post takes a vector of Word16s as well as blockCount (the count of 8×8 blocks) and compsCount (the number of components, pardon my terminology, in each block, for an image encoded in YCbCr it would be 3) and produces some value based on the results of the IDCTs. The total size of the vector is then blockCount * compsCount * 8 * 8 elements.
Now, I also have a (quite big) test image with about 320k such blocks, each having 3 components, giving about a million 8×8 matrices total. How long would it take to perform IDCT on all of them without anything too smart? My back-of-the-napkin calculation says about "30-250 ms of a single core time":
For each matrix, to compute a single resulting element in the matrix, we need to do 8 multiplications and 7 additions. For a really pessimistic bound, let's say we don't do any SIMD stuff. Reciprocal throughput of all the relevant instructions is 1 on my CPU, so it takes 15 cycles (well, there's also instruction latency, but that shouldn't be a factor, especially if we're talking about doing this lots of times).
To compute each resulting matrix, we need to compute 64 resulting elements, so that's 64×15, or about 1 thousand cycles per resulting matrix.
And we need to compute about 1 million matrices, giving about 1 billion cycles.
My CPU runs at 4 GHz (and the task is super cache friendly so it shouldn't be limited by the memory), so it should take about ¼th of the second — hence the worst estimate of 250 ms.
Now, if the compiler is smart enough, it surely will use SIMD instructions for vector multiplication and for horizontal sum, reducing 15 cycles per element to about two (well, there's also loading and packing part, but the IDCT matrix can wholly reside in eight XMM registers out of 16 my CPU has, so there's no need to load that, and there's also plenty of room for instruction-level parallelism, so we could probably disregard that for the purposes of this estimate). So, that's a speed up of about 8 times, giving the optimistic bound of 250 / 8 ≈ 30 ms.
Given all of the above, I was somewhat dumbfounded to see the function below to take 2 seconds. What's more interesting, it isn't sensitive at all to whether the elements are traversed in row-order or column-order: replacing
arrSlice = R.unsafeSlice arr (sh :. x :. All)
idctSlice = R.unsafeSlice idctMat (Any :. y :. All)
with
arrSlice = R.unsafeSlice arr (sh :. All :. x)
idctSlice = R.unsafeSlice idctMat (Any :. All :. y)
just for the sake of seeing how the run time changes shows that the run time does not change at all (and I think this alone shows that something is very wrong).
Oh, and if I remove the final R.sumAllS (which I had in my pre-repa times to ensure everything gets evaluated) and replace it with, say, idct blockMats `R.deepSeqArray` 0, the run time does not change in any significant way (well, it reduces for about 5%, but that's insignificant).
So, finally, here's the module:
{-# LANGUAGE OverloadedLists, TypeOperators, FlexibleContexts, TypeFamilies #-}
{-# LANGUAGE Strict #-}
{-# OPTIONS -optlo-O3 #-}
module Data.Image.Jpeg.IDCT(idctBlocks, zigzag) where
import qualified Data.Array.Repa as R
import qualified Data.Array.Repa.Eval as R
import qualified Data.Array.Repa.Unsafe as R
import qualified Data.Vector.Unboxed as V
import Control.Monad.Identity
import Data.Array.Repa (Array, All(..), Any(..), U, Z(..), (:.)(..))
import GHC.Word
zigzag :: V.Vector Int
zigzag =
[
0, 1, 5, 6, 14, 15, 27, 28,
2, 4, 7, 13, 16, 26, 29, 42,
3, 8, 12, 17, 25, 30, 41, 43,
9, 11, 18, 24, 31, 40, 44, 53,
10, 19, 23, 32, 39, 45, 52, 54,
20, 22, 33, 38, 46, 51, 55, 60,
21, 34, 37, 47, 50, 56, 59, 61,
35, 36, 48, 49, 57, 58, 62, 63
]
computeP :: (R.Load r sh a, V.Unbox a) => Array r sh a -> Array U sh a
computeP = runIdentity . R.computeP
{-# INLINE computeP #-}
unZigzagify :: (Z :. Int :. Int :. Int) -> (Z :. Int :. Int :. Int)
unZigzagify (_ :. b :. c :. p) = Z :. b :. c :. (zigzag `V.unsafeIndex` p)
{-# INLINE unZigzagify #-}
idctMat :: Array U R.DIM2 Float
idctMat = R.computeUnboxedS $ R.transpose $ R.fromFunction (Z :. 8 :. 8) (\(_ :. r :. c) -> point r (c + 1))
where
point 0 _ = sqrt $ 1 / 8
point r c = sqrt (2 / 8) * cos (pi * (2 * c' - 1) * r' / (2 * 8))
where
r' = fromIntegral r
c' = fromIntegral c
idct :: R.Source r Float
=> Array r R.DIM4 Float
-> Array U R.DIM4 Float
idct arr = computeP $ R.fromFunction dims f
where
dims = R.extent arr
f (sh :. x :. y) = R.sumAllS (R.zipWith (*) arrSlice idctSlice)
where
arrSlice = R.unsafeSlice arr (sh :. x :. All)
idctSlice = R.unsafeSlice idctMat (Any :. y :. All)
{-# INLINE idct #-}
idctBlocks :: Int -> Int -> V.Vector Word16 -> Float
idctBlocks blocksCount compsCount blocks = R.sumAllS $ idct blockMats
where
flatExtent = Z :. blocksCount :. compsCount :. 64 :: R.DIM3
matExtent = Z :. blocksCount :. compsCount :. 8 :. 8 :: R.DIM4
reparr = R.fromUnboxed flatExtent blocks
blockMats = R.map fromIntegral $ R.reshape matExtent $ R.computeUnboxedS $ R.backpermute flatExtent unZigzagify reparr
{-# NOINLINE idctBlocks #-}
I have -O2 -fllvm for my whole project, so those options apply to this file as well. And, of course, I might have messed up the IDCT matrix for its transpose, but that shouldn't affect the performance.
So, what I could do differently, and how this could be improved?
By the way, using the typical repa-recommended matrix multiplication of the form
matMul :: (R.Source r1 Float, R.Source r2 Float, R.Shape sh)
=> Array r1 (sh :. Int :. Int) Float
-> Array r2 (sh :. Int :. Int) Float
-> Array U (sh :. Int :. Int) Float
matMul a b = runIdentity $ R.sumP (R.zipWith (*) a' b')
where
a' = R.extend (Any :. All :. (8 :: Int) :. All) a
b' = R.extend (Any :. (8 :: Int) :. All :. All) b
{-# INLINE matMul #-}
along with properly R.extended idctMat doesn't help too much — in fact, performance degrades even further, becoming closer to 30 seconds.

Related

Pytorch: Memory Efficient weighted sum with weights shared along channels

Inputs:
1) I = Tensor of dim (N, C, X) (Input)
2) W = Tensor of dim (N, X, Y) (Weight)
Output:
1) O = Tensor of dim (N, C, Y) (Output)
I want to compute:
I = I.view(N, C, X, 1)
W = W.view(N, 1, X, Y)
PROD = I*W
O = PROD.sum(dim=2)
return O
without incurring N * C * X * Y memory overhead.
Basically I want to calculate the weighted sum of a feature map wherein the weights are the same along the channel dimension, without incurring memory overhead per channel.
Maybe I could use
from itertools import product
O = torch.zeros(N, C, Y)
for n, x, y in product(range(N), range(X), range(Y)):
O[n, :, y] += I[n, :, x]*W[n, x, y]
return O
but that would be slower (no broadcasting) and I'm not sure how much memory overhead would be incurred by saving variables for the backward pass.

You can use torch.bmm (https://pytorch.org/docs/stable/torch.html#torch.bmm). Just do torch.bmm(I,W)
To verify the results :
import torch
N, C, X, Y= 100, 10, 9, 8
i = torch.rand(N,C,X)
w = torch.rand(N,X,Y)
o = torch.bmm(i,w)
# desired result code
I = i.view(N, C, X, 1)
W = w.view(N, 1, X, Y)
PROD = I*W
O = PROD.sum(dim=2)
print(torch.allclose(O,o)) # should output True if outputs are same.
EDIT: Ideally, I would assume using pytorch's internal matrix multiplication is efficient. However, you can also measure the memory usage with tracemalloc (at least on CPU). See https://discuss.pytorch.org/t/measuring-peak-memory-usage-tracemalloc-for-pytorch/34067 for GPU.
import torch
import tracemalloc
tracemalloc.start()
N, C, X, Y= 100, 10, 9, 8
i = torch.rand(N,C,X)
w = torch.rand(N,X,Y)
o = torch.bmm(i,w)
# output is a tuple indicating current memory and peak memory
print(tracemalloc.get_traced_memory())
You can do the same with other code and see the bmm implementation is indeed efficient.
import torch
import tracemalloc
tracemalloc.start()
N, C, X, Y= 100, 10, 9, 8
i = torch.rand(N,C,X)
w = torch.rand(N,X,Y)
I = i.view(N, C, X, 1)
W = w.view(N, 1, X, Y)
PROD = I*W
O = PROD.sum(dim=2)
# output is a tuple indicating current memory and peak memory
print(tracemalloc.get_traced_memory())

Haskell : matrix sorting much slower than vector sorting

I have to sort the lines of large integer matrices in Haskell and I started benchmarking with random data. I found that Haskell is 3 times slower than C++.
Because of the randomness, I expect line comparison to always terminate at the first column (which should have no duplicates). So I narrowed the matrix to a single column implemented as a Vector (Unboxed.Vector Int) and compared its sorting to a usual Vector Int.
Vector Int sorts as fast as C++ (good news !), but again, the column matrix is 3 times slower. Do you have an idea why ? Please find the code below.
import qualified Data.Vector.Unboxed as UV(Vector, fromList)
import qualified Data.Vector as V(Vector, fromList, modify)
import Criterion.Main(env, bench, nf, defaultMain)
import System.Random(randomIO)
import qualified Data.Vector.Algorithms.Intro as Alg(sort)
randomVector :: Int -> IO (V.Vector Int)
randomVector count = V.fromList <$> mapM (\_ -> randomIO) [1..count]
randomVVector :: Int -> IO (V.Vector (UV.Vector Int))
randomVVector count = V.fromList <$> mapM (\_ -> do
x <- randomIO
return $ UV.fromList [x]) [1..count]
benchSort :: IO ()
benchSort = do
let bVVect = env (randomVVector 300000) $ bench "sortVVector" . nf (V.modify Alg.sort)
bVect = env (randomVector 300000) $ bench "sortVector" . nf (V.modify Alg.sort)
defaultMain [bVect, bVVect]
main = benchSort

As Edward Kmett as explained to me, the Haskell version has one extra layer of indirection. A UV.Vector looks something like
data Vector a = Vector !Int !Int ByteArray#
So each entry in your vector of vectors is actually a pointer to a record holding slice indices and a pointer to an array of bytes. This is an extra indirection that the C++ code doesn't have. The solution is to use an ArrayArray#, which is an array of direct pointers to byte arrays or to further ArrayArray#s. If you need vector, you'll have to figure out what to do about the slicing machinery. Another option is to switch to primitive, which offers simpler arrays.

Following dfeuer's advice, implementing a vector of vectors as an ArrayArray# is 4 times faster than Vector (Unboxed.Vector Int) and only 40% slower than sorting a c++ std::vector<std::vector<int> > :
import Control.Monad.Primitive
import Data.Primitive.ByteArray
import qualified Data.Vector.Generic.Mutable.Base as GM(MVector(..))
import GHC.Prim
data MutableArrayArray s a = MutableArrayArray (MutableArrayArray# s)
instance GM.MVector MutableArrayArray ByteArray where
{-# INLINE basicLength #-}
basicLength (MutableArrayArray marr) = I# (sizeofMutableArrayArray# marr)
{-# INLINE basicUnsafeRead #-}
basicUnsafeRead (MutableArrayArray marr) (I# i) = primitive $ \s -> case readByteArrayArray# marr i s of
(# s1, bar #) -> (# s1, ByteArray bar #)
{-# INLINE basicUnsafeWrite #-}
basicUnsafeWrite (MutableArrayArray marr) (I# i) (ByteArray bar) = primitive $ \s ->
(# writeByteArrayArray# marr i bar s, () #)
For example, sorting a matrix of integers will then use
sortIntArrays :: ByteArray -> ByteArray -> Ordering
sortIntArrays x y = let h1 = indexByteArray x 0 :: Int
h2 = indexByteArray y 0 :: Int in
compare h1 h2

Efficient summation in OCaml

Please note I am almost a complete newbie in OCaml. In order to learn a bit, and test its performance, I tried to implement a module that approximates Pi using the Leibniz series.
My first attempt led to a stack overflow (the actual error, not this site). Knowing from Haskell that this may come from too many "thunks", or promises to compute something, while recursing over the addends, I looked for some way of keeping just the last result while summing with the next. I found the following tail-recursive implementations of sum and map in the notes of an OCaml course, here and here, and expected the compiler to produce an efficient result.
However, the resulting executable, compiled with ocamlopt, is much slower than a C++ version compiled with clang++. Is this code as efficient as possible? Is there some optimization flag I am missing?
My complete code is:
let (--) i j =
let rec aux n acc =
if n < i then acc else aux (n-1) (n :: acc)
in aux j [];;
let sum_list_tr l =
let rec helper a l = match l with
| [] -> a
| h :: t -> helper (a +. h) t
in helper 0. l
let rec tailmap f l a = match l with
| [] -> a
| h :: t -> tailmap f t (f h :: a);;
let rev l =
let rec helper l a = match l with
| [] -> a
| h :: t -> helper t (h :: a)
in helper l [];;
let efficient_map f l = rev (tailmap f l []);;
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.);;
let pi_approx n =
4. *. sum_list_tr (efficient_map summand (0 -- n));;
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (pi_approx n);;
Just for reference, here are the measured times on my machine:
❯❯❯ time ocaml/main 10000000
3.14159275359
ocaml/main 10000000 3,33s user 0,30s system 99% cpu 3,625 total
❯❯❯ time cpp/main 10000000
3.14159
cpp/main 10000000 0,17s user 0,00s system 99% cpu 0,174 total
For completeness, let me state that the first helper function, an equivalent to Python's range, comes from this SO thread, and that this is run using OCaml version 4.01.0, installed via MacPorts on a Darwin 13.1.0.

As I noted in a comment, OCaml's float are boxed, which puts OCaml to a disadvantage compared to Clang.
However, I may be noticing another typical rough edge trying OCaml after Haskell:
if I see what your program is doing, you are creating a list of stuff, to then map a function on that list and finally fold it into a result.
In Haskell, you could more or less expect such a program to be automatically “deforested” at compile-time, so that the resulting generated code was an efficient implementation of the task at hand.
In OCaml, the fact that functions can have side-effects, and in particular functions passed to high-order functions such as map and fold, means that it would be much harder for the compiler to deforest automatically. The programmer has to do it by hand.
In other words: stop building huge short-lived data structures such as 0 -- n and (efficient_map summand (0 -- n)). When your program decides to tackle a new summand, make it do all it wants to do with that summand in a single pass. You can see this as an exercise in applying the principles in Wadler's article (again, by hand, because for various reasons the compiler will not do it for you despite your program being pure).
Here are some results:
$ ocamlopt v2.ml
$ time ./a.out 1000000
3.14159165359
real 0m0.020s
user 0m0.013s
sys 0m0.003s
$ ocamlopt v1.ml
$ time ./a.out 1000000
3.14159365359
real 0m0.238s
user 0m0.204s
sys 0m0.029s
v1.ml is your version. v2.ml is what you might consider an idiomatic OCaml version:
let rec q_pi_approx p n acc =
if n = p
then acc
else q_pi_approx (succ p) n (acc +. (summand p))
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx 0 n 0.));;
(reusing summand from your code)
It might be more accurate to sum from the last terms to the first, instead of from the first to the last. This is orthogonal to your question, but you may consider it as an exercise in modifying a function that has been forcefully made tail-recursive. Besides, the (-1.) ** m expression in summand is mapped by the compiler to a call to the pow() function on the host, and that's a bag of hurt you may want to avoid.

I've also tried several variants, here are my conclusions:
Using arrays
Using recursion
Using imperative loop
Recursive function is about 30% more effective than array implementation. Imperative loop is approximately as much effective as a recursion (maybe even little slower).
Here're my implementations:
Array:
open Core.Std
let pi_approx n =
let f m = (-1.) ** m /. (2. *. m +. 1.) in
let qpi = Array.init n ~f:Float.of_int |>
Array.map ~f |>
Array.reduce_exn ~f:(+.) in
qpi *. 4.0
Recursion:
let pi_approx n =
let rec loop n acc m =
if m = n
then acc *. 4.0
else
let acc = acc +. (-1.) ** m /. (2. *. m +. 1.) in
loop n acc (m +. 1.0) in
let n = float_of_int n in
loop n 0.0 0.0
This can be further optimized, by moving local function loop outside, so that compiler can inline it.
Imperative loop:
let pi_approx n =
let sum = ref 0. in
for m = 0 to n -1 do
let m = float_of_int m in
sum := !sum +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. !sum
But, in the code above creating a ref to the sum will incur boxing/unboxing on each step, that we can further optimize this code by using float_ref trick:
type float_ref = { mutable value : float}
let pi_approx n =
let sum = {value = 0.} in
for m = 0 to n - 1 do
let m = float_of_int m in
sum.value <- sum.value +. (-1.) ** m /. (2. *. m +. 1.)
done;
4.0 *. sum.value
Scoreboard
for-loop (with float_ref) : 1.0
non-local recursion : 0.89
local recursion : 0.86
Pascal's version : 0.77
for-loop (with float ref) : 0.62
array : 0.47
original : 0.08
Update
I've updated the answer, as I've found a way to give 40% speedup (or 33% in comparison with #Pascal's answer.

I would like to add that although floats are boxed in OCaml, float arrays are unboxed. Here is a program that builds a float array corresponding to the Leibnitz sequence and uses it to approximate π:
open Array
let q_pi_approx n =
let summand n =
let m = float_of_int n
in (-1.) ** m /. (2. *. m +. 1.) in
let a = Array.init n summand in
Array.fold_left (+.) 0. a
let n = int_of_string Sys.argv.(1);;
Printf.printf "%F\n" (4. *. (q_pi_approx n));;
Obviously, it is still slower than a code that doesn't build any data structure at all. Execution times (the version with array is the last one):
time ./v1 10000000
3.14159275359
real 0m2.479s
user 0m2.380s
sys 0m0.104s
time ./v2 10000000
3.14159255359
real 0m0.402s
user 0m0.400s
sys 0m0.000s
time ./a 10000000
3.14159255359
real 0m0.453s
user 0m0.432s
sys 0m0.020s

Help with algorithm for compute columns sum of a (quadtree) matrix?

Given this definition and a test matrix:
data (Eq a, Show a) => QT a = C a | Q (QT a) (QT a) (QT a) (QT a)
deriving (Eq, Show)
data (Eq a, Num a, Show a) => Mat a = Mat {nexp :: Int, mat :: QT a}
deriving (Eq, Show)
-- test matrix, exponent is 2, that is matrix is 4 x 4
test = Mat 2 (Q (C 5) (C 6) (Q (C 1) (C 0) (C 2) (C 1)) (C 3))
| | |
| 5 | 6 |
| | |
-------------
|1 | 0| |
|--|--| 3 |
|2 | 1| |
I'm trying to write a function that will output a list of columns sum, like: [13, 11, 18, 18]. The base idea is to sum each sub-quadtree:
If quadtree is (C c), then output the a repeating 2 ^ (n - 1) times the value c * 2 ^ (n - 1). Example: first quadtree is (C 5) so we repeat 5 * 2^(2 - 1) = 10, 2 ^ (n - 1) = 2 times, obtaining [5, 5].
Otherwise, given (Q a b c d), we zipWith the colsum of a and c (and b and d).
Of course this is not working (not even compiling) because after some recursion we have:
zipWith (+) [[10, 10], [12, 12]] [zipWith (+) [[1], [0]] [[2], [1]], [6, 6]]
Because I'm beginning with Haskell I feel I'm missing something, need some advice on function I can use. Not working colsum definition is:
colsum :: (Eq a, Show a, Num a) => Mat a -> [a]
colsum m = csum (mat m)
where
n = nexp m
csum (C c) = take (2 ^ n) $ repeat (c * 2 ^ n)
csum (Q a b c d) = zipWith (+) [colsum $ submat a, colsum $ submat b]
[colsum $ submat c, colsum $ submat d]
submat q = Mat (n - 1) q
Any ideas would be great and much appreciated...

Probably "someone" should have explained to who is worried about the depth of the QuadTree that the nexp field in the Matrix type is exactly meant to be used to determine the real size of a (C _).
About the solution presented in the first answer, ok it works. However it is quite useless to construct and deconstruct Mat, this could be easily avoided. Moreover the call to fromIntegral to "bypass" the type checking problem coming from the use of replicate can be solved without forcing to first going to Integral and then coming back, like
let m = 2^n; k=2^n in replicate k (m*x)
Anyway, the challenge here is to avoid the quadratical behavior due to the ++, that is what I would expect.
Cheers,

Let's consider your colsum:
colsum :: (Eq a, Show a, Num a) => Mat a -> [a]
colsum m = csum (mat m)
where
n = nexp m
csum (C c) = take (2 ^ n) $ repeat (c * 2 ^ n)
csum (Q a b c d) = zipWith (+) [colsum $ submat a, colsum $ submat b]
[colsum $ submat c, colsum $ submat d]
submat q = Mat (n - 1) q
It is almost correct, except the line where you define csum (Q a b c d) = ....
Let think about types. colsum returns a list of numbers. ZipWith (+) sums two lists elementwise:
ghci> :t zipWith (+)
zipWith (+) :: Num a => [a] -> [a] -> [a]
This means that you need to pass two lists of numbers to zipWith (+). Instead you create two lists of lists of numbers, like this:
[colsum $ submat a, colsum $ submat b]
The type of this expression is [[a]], not [a] as you need.
What you need to do is to concatenate two lists of numbers to obtain a single list of numbers (and this is, probably, what you intended to do):
((colsum $ submat a) ++ (colsum $ submat b))
Similarly, you concatenate lists of partial sums for c and d then your function should start working.

Let's go more general, and come back to the goal at hand.
Consider how we would project a quadtree into a 2n×2n matrix. We may not need to create this projection in order to calculate its column sums, but it's a useful notion to work with.
If our quadtree is a single cell, then we'd just fill the entire matrix with that cell's value.
Otherwise, if n ≥ 1, we can divide the matrix up into quadrants, and let the subquadtrees each fill one quadrant (that is, have each subquadtree fill a 2n-1×2n-1 matrix).
Note that there's still a case remaining. What if n = 0 (that is, we have a 1×1 matrix) and the quadtree isn't a single cell? We need to specify some behaviour for this case - maybe we just let one of the subquadtrees populate the entire matrix, or we fill the matrix with some default value.
Now consider the column sums of such a projection.
If our quadtree was a single cell, then the 2n column sums will all be 2n
times the value stored in that cell.
(hint: look at replicate and genericReplicate on hoogle).
Otherwise, if n ≥ 1, then each column overlaps two distinct quadrants.
Half of our columns will be completely determined by the western quadrants,
and the other half by the eastern quadrants, The sum for a particular column
can be defined as the sum of the contribution to that column
from its northern half (that is, the column sum for that column in the northern quadrant),
and its southern half (likewise).
(hint: We'll need to append the western column sums to the eastern column sums
to get all the column sums, and combien the northern and southern demi-column sums
to get the actual sums for each column).
Again, we have a third case, and the column sum here depends on how
you project four subquadtrees onto a 1×1 matrix. Fortunately, a 1×1 matrix means
only a single column sum!
Now, we only care about a particular projection - the projection onto a matrix of size 2dd×2d
where d is the depth of our quadtree. So you'll need to figure the depth too. Since a
single cell fits "naturally" into a matrix of size 1×1, that implies that it has a
depth of 0. A quadbranch must have depth great enough to allow each of its subquads to fit
into their quadrant of the matrix.

Functional learning woes

I'm a beginner to functional languages, and I'm trying to get the whole thing down in Haskell. Here's a quick-and-dirty function that finds all the factors of a number:
factors :: (Integral a) => a -> [a]
factors x = filter (\z -> x `mod` z == 0) [2..x `div` 2]
Works fine, but I found it to be unbearably slow for large numbers. So I made myself a better one:
factorcalc :: (Integral a) => a -> a -> [a] -> [a]
factorcalc x y z
| y `elem` z = sort z
| x `mod` y == 0 = factorcalc x (y+1) (z ++ [y] ++ [(x `div` y)])
| otherwise = factorcalc x (y+1) z
But here's my problem: Even though the code works, and can cut literally hours off the execution time of my programs, it's hideous!
It reeks of ugly imperative thinking: It constantly updates a counter and a data structure in a loop until it finishes. Since you can't change state in purely functional programming, I cheated by holding the data in the parameters, which the function simply passes to itself over and over again.
I may be wrong, but there simply must be a better way of doing the same thing...

Note that the original question asked for all the factors, not for only the prime factors. There being many fewer prime factors, they can probably be found more quickly. Perhaps that's what the OQ wanted. Perhaps not. But let's solve the original problem and put the "fun" back in "functional"!
Some observations:
The two functions don't produce the same output---if x is a perfect square, the second function includes the square root twice.
The first function enumerates checks a number of potential factors proportional to the size of x; the second function checks only proportional to the square root of x, then stops (with the bug noted above).
The first function (factors) allocates a list of all integers from 2 to n div 2, where the second function never allocates a list but instead visits fewer integers one at a time in a parameter. I ran the optimizer with -O and looked at the output with -ddump-simpl, and GHC just isn't smart enough to optimize away those allocations.
factorcalc is tail-recursive, which means it compiles into a tight machine-code loop; filter is not and does not.
Some experiments show that the square root is the killer:
Here's a sample function that produces the factors of x from z down to 2:
factors_from x 1 = []
factors_from x z
| x `mod` z == 0 = z : factors_from x (z-1)
| otherwise = factors_from x (z-1)
factors'' x = factors_from x (x `div` 2)
It's a bit faster because it doesn't allocate, but it's still not tail-recursive.
Here's a tail-recursive version that is more faithful to the original:
factors_from' x 1 l = l
factors_from' x z l
| x `mod` z == 0 = factors_from' x (z-1) (z:l)
| otherwise = factors_from' x (z-1) l
factors''' x = factors_from x (x `div` 2)
This is still slower than factorcalc because it enumerates all the integers from 2 to x div 2, whereas factorcalc stops at the square root.
Armed with this knowledge, we can now create a more functional version of factorcalc which replicates both its speed and its bug:
factors'''' x = sort $ uncurry (++) $ unzip $ takeWhile (uncurry (<=)) $
[ (z, x `div` z) | z <- [2..x], x `mod` z == 0 ]
I didn't time it exactly, but given 100 million as an input, both it and factorcalc terminate instantaneously, where the others all take a number of seconds.
How and why the function works is left as an exercise for the reader :-)
ADDENDUM: OK, to mitigate the eyeball bleeding, here's a slightly saner version (and without the bug):
saneFactors x = sort $ concat $ takeWhile small $
[ pair z | z <- [2..], x `mod` z == 0 ]
where pair z = if z * z == x then [z] else [z, x `div` z]
small [z, z'] = z < z'
small [z] = True

Okay, take a deep breath. It'll be all right.
First of all, why is your first attempt slow? How is it spending its time?
Can you think of a recursive definition for the prime factorization that doesn't have that property?
(Hint.)

Firstly, although factorcalc is "ugly", you could add a wrapper function factors' x = factorscalc x 2 [], add a comment, and move on.
If you want to make a 'beautiful' factors fast, you need to find out why it is slow. Looking at your two functions, factors walks the list about n/2 elements long, but factorcalc stops after around sqrt n iterations.
Here is another factors that also stops after about sqrt n iterations, but uses a fold instead of explicit iteration. It also breaks the problem into three parts: finding the factors (factor); stopping at the square root of x (small) and then computing pairs of factors (factorize):
factors' :: (Integral a) => a -> [a]
factors' x = sort (foldl factorize [] (takeWhile small (filter factor [2..])))
where
factor z = x `mod` z == 0
small z = z <= (x `div` z)
factorize acc z = z : (if z == y then acc else y : acc)
where y = x `div` z
This is marginally faster than factorscalc on my machine. You can fuse factor and factorize and it is about twice as fast as factorscalc.
The Profiling and Optimization chapter of Real World Haskell is a good guide to the GHC suite's performance tools for tackling tougher performance problems.
By the way, I have a minor style nitpick with factorscalc: it is much more efficient to prepend single elements to the front of a list O(1) than it is to append to the end of a list of length n O(n). The lists of factors are typically small, so it is not such a big deal, but factorcalc should probably be something like:
factorcalc :: (Integral a) => a -> a -> [a] -> [a]
factorcalc x y z
| y `elem` z = sort z
| x `mod` y == 0 = factorcalc x (y+1) (y : (x `div` y) : z)
| otherwise = factorcalc x (y+1) z

Since you can't change state in purely
functional programming, I cheated by
holding the data in the parameters,
which the function simply passes to
itself over and over again.
Actually, this is not cheating; this is a—no, make that the—standard technique! That sort of parameter is usually known as an "accumulator," and it's generally hidden within a helper function that does the actual recursion after being set up by the function you're calling.
A common case is when you're doing list operations that depend on the previous data in the list. The two problems you need to solve are, where do you get the data about previous iterations, and how do you deal with the fact that your "working area of interest" for any particular iteration is actually at the tail of the result list you're building. For both of these, the accumulator comes to the rescue. For example, to generate a list where each element is the sum of all of the elements of the input list up to that point:
sums :: Num a => [a] -> [a]
sums inp = helper inp []
where
helper [] acc = reverse acc
helper (x:xs) [] = helper xs [x]
helper (x:xs) acc#(h:_) = helper xs (x+h : acc)
Note that we flip the direction of the accumulator, so we can operate on the head of that, which is much more efficient (as Dominic mentions), and then we just reverse the final output.
By the way, I found reading The Little Schemer to be a useful introduction and offer good practice in thinking recursively.

This seemed like an interesting problem, and I hadn't coded any real Haskell in a while, so I gave it a crack. I've run both it and Norman's factors'''' against the same values, and it feels like mine's faster, though they're both so close that it's hard to tell.
factors :: Int -> [Int]
factors n = firstFactors ++ reverse [ n `div` i | i <- firstFactors ]
where
firstFactors = filter (\i -> n `mod` i == 0) (takeWhile ( \i -> i * i <= n ) [2..n])
Factors can be paired up into those that are greater than sqrt n, and those that are less than or equal to (for simplicity's sake, the exact square root, if n is a perfect square, falls into this category. So if we just take the ones that are less than or equal to, we can calculate the others later by doing div n i. They'll be in reverse order, so we can either reverse firstFactors first or reverse the result later. It doesn't really matter.

This is my "functional" approach to the problem. ("Functional" in quotes, because I'd approach this problem the same way even in non-functional languages, but maybe that's because I've been tainted by Haskell.)
{-# LANGUAGE PatternGuards #-}
factors :: (Integral a) => a -> [a]
factors = multiplyFactors . primeFactors primes 0 [] . abs where
multiplyFactors [] = [1]
multiplyFactors ((p, n) : factors) =
[ pn * x
| pn <- take (succ n) $ iterate (* p) 1
, x <- multiplyFactors factors ]
primeFactors _ _ _ 0 = error "Can't factor 0"
primeFactors (p:primes) n list x
| (x', 0) <- x `divMod` p
= primeFactors (p:primes) (succ n) list x'
primeFactors _ 0 list 1 = list
primeFactors (_:primes) 0 list x = primeFactors primes 0 list x
primeFactors (p:primes) n list x
= primeFactors primes 0 ((p, n) : list) x
primes = sieve [2..]
sieve (p:xs) = p : sieve [x | x <- xs, x `mod` p /= 0]
primes is the naive Sieve of Eratothenes. There's better, but this is the shortest method.
sieve [2..]
=> 2 : sieve [x | x <- [3..], x `mod` 2 /= 0]
=> 2 : 3 : sieve [x | x <- [4..], x `mod` 2 /= 0, x `mod` 3 /= 0]
=> 2 : 3 : sieve [x | x <- [5..], x `mod` 2 /= 0, x `mod` 3 /= 0]
=> 2 : 3 : 5 : ...
primeFactors is the simple repeated trial-division algorithm: it walks through the list of primes, and tries dividing the given number by each, recording the factors as it goes.
primeFactors (2:_) 0 [] 50
=> primeFactors (2:_) 1 [] 25
=> primeFactors (3:_) 0 [(2, 1)] 25
=> primeFactors (5:_) 0 [(2, 1)] 25
=> primeFactors (5:_) 1 [(2, 1)] 5
=> primeFactors (5:_) 2 [(2, 1)] 1
=> primeFactors _ 0 [(5, 2), (2, 1)] 1
=> [(5, 2), (2, 1)]
multiplyPrimes takes a list of primes and powers, and explodes it back out to a full list of factors.
multiplyPrimes [(5, 2), (2, 1)]
=> [ pn * x
| pn <- take (succ 2) $ iterate (* 5) 1
, x <- multiplyPrimes [(2, 1)] ]
=> [ pn * x | pn <- [1, 5, 25], x <- [1, 2] ]
=> [1, 2, 5, 10, 25, 50]
factors just strings these two functions together, along with an abs to prevent infinite recursion in case the input is negative.

I don't know much about Haskell, but somehow I think this link is appropriate:
http://www.willamette.edu/~fruehr/haskell/evolution.html
Edit: I'm not entirely sure why people are so aggressive about the downvoting on this. The original poster's real problem was that the code was ugly; while it's funny, the point of the linked article is, to some extent, that advanced Haskell code is, in fact, ugly; the more you learn, the uglier your code gets, to some extent. The point of this answer was to point out to the OP that apparently, the ugliness of the code that he was lamenting is not uncommon.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio