Memoization done, what now? - performance

I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}

I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.

Related

How to optimize a function and minimize allocations

The following function generates primes up to N. For large N, this becomes quite slow, my Julia implementation is 5X faster for N = 10**7. I guess the creation of a large integer array and using pack to collect the result is the slowest part. I tried counting .true.s first, then allocating res(:) and populating it using a loop, but the speedup was negligible (4%) as I iterate the prims array twice in this case. In Julia, I used findall which does exactly what I did; iterating the array twice, first counting trues and allocationg result then populating it. Any ideas? Thank you.
Compiler:
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.3.210 Build 20180410 (on Windows 10)
Options: ifort -warn /O3 -heap-arrays:8000000
program main
implicit none
integer, allocatable :: primes(:)
integer :: t0, t1, count_rate, count_max
call system_clock(t0, count_rate, count_max)
primes = do_primes(10**7)
call system_clock(t1)
print '(a,f7.5,a)', 'Elapsed time: ', real(t1-t0)/count_rate, ' seconds'
print *, primes(1:10)
contains
function do_primes(N) result (res)
integer, allocatable :: res(:), array(:)
logical, allocatable :: prims(:)
integer :: N, i, j
allocate (prims(N))
prims = .true.
i = 3
do while (i * i < N)
j = i
do while (j * i < N)
prims(j*i) = .false.
j = j + 2
end do
i = i + 2
end do
prims(1) = .false.
prims(2) = .true.
do i = 4, N, 2
prims(i) = .false.
end do
allocate (array(N))
do i = 1, N
array(i) = i
end do
res = pack(array, prims)
end
end
Timing (147 runs):
Elapsed time: 0.14723 seconds
Edit:
I converted the do whiles to straight dos as per #IanBush comment like this, still no speedup:
do i = 3, sqrt(dble(N)), 2
do j = i, N/i, 2
prims(j*i) = .false.
end do
end do
The Julia implementation:
function do_primes(N)
prims = trues(N)
i = 3
while i * i < N
j = i
while j * i < N
prims[j*i] = false
j = j + 2
end
i = i + 2
end
prims[1] = false
prims[2] = true
prims[4:2:N] .= false
return findall(prims)
end
Timing:
using Benchmarktools
#benchmark do_primes(10^7)
BenchmarkTools.Trial:
memory estimate: 6.26 MiB
allocs estimate: 5
--------------
minimum time: 32.227 ms (0.00% GC)
median time: 32.793 ms (0.00% GC)
mean time: 34.098 ms (3.92% GC)
maximum time: 94.479 ms (65.46% GC)
--------------
samples: 147
evals/sample: 1

Exponent calculation speed

I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.

Memory allocation in a fixed point algorithm

I need to find the fixed point of a function f. The algorithm is very simple:
Given X, compute f(X)
If ||X-f(X)|| is lower than a certain tolerance, exit and return X,
otherwise set X equal to f(X) and go back to 1.
I'd like to be sure I'm not allocating memory for a new object at every iteration
For now, the algorithm looks like this:
iter1 = function(x::Vector{Float64})
for iter in 1:max_it
oldx = copy(x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
Here g1(x) is a function that sets x to f(x)
But it seems this loop allocates a new vector at every loop (see below).
Another way to write the algorithm is the following:
iter2 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
(oldx, x) = (x, oldx)
g2(x, oldx)
delta = vnormdiff(oldx, x, 2)
if delta < tolerance
break
end
end
end
where g2(x1, x2) is a function that sets x1 to f(x2).
Is thi the most efficient and natural way to write this kind of iteration problem?
Edit1: timing shows that the second code is faster:
using NumericExtensions
max_it = 1000
tolerance = 1e-8
max_it = 100
g1 = function(x::Vector{Float64})
for i in 1:length(x)
x[i] = x[i]/2
end
end
g2 = function(newx::Vector{Float64}, x::Vector{Float64})
for i in 1:length(x)
newx[i] = x[i]/2
end
end
x = fill(1e7, int(1e7))
#time iter1(x)
# elapsed time: 4.688103075 seconds (4960117840 bytes allocated, 29.72% gc time)
x = fill(1e7, int(1e7))
#time iter2(x)
# elapsed time: 2.187916177 seconds (80199676 bytes allocated, 0.74% gc time)
Edit2: using copy!
iter3 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
copy!(oldx, x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
x = fill(1e7, int(1e7))
#time iter3(x)
# elapsed time: 2.745350176 seconds (80008088 bytes allocated, 1.11% gc time)
I think replacing the following lines in the first code
for iter = 1:max_it
oldx = copy( x )
...
by
oldx = zeros( N )
for iter = 1:max_it
oldx[:] = x # or copy!( oldx, x )
...
will be more efficient because no array is allocated. Also, the code can be made more efficient by writing for-loops explicitly. This can be seen, for example, from the following comparison
function test()
N = 1000000
a = zeros( N )
b = zeros( N )
#time c = copy( a )
#time b[:] = a
#time copy!( b, a )
#time \
for i = 1:length(a)
b[i] = a[i]
end
#time \
for i in eachindex(a)
b[i] = a[i]
end
end
test()
The result obtained with Julia0.4.0 on Linux(x86_64) is
elapsed time: 0.003955609 seconds (7 MB allocated)
elapsed time: 0.001279142 seconds (0 bytes allocated)
elapsed time: 0.000836167 seconds (0 bytes allocated)
elapsed time: 1.19e-7 seconds (0 bytes allocated)
elapsed time: 1.28e-7 seconds (0 bytes allocated)
It seems that copy!() is faster than using [:] in the left-hand side,
though the difference becomes marginal in repeated calculations (there seems to be
some overhead for the first [:] calculation). Btw, the last example using eachindex() is very convenient for looping over multi-dimensional arrays.
Similar comparison can be made for vnormdiff(), where use of norm( x - oldx ) etc is slower than an explicit loop for vector norm, because the former allocates one temporary array for x - oldx.

How do I optimize a loop which can be fully strict

I'm trying to write a brute-force solution to Project Euler Problem #145, and I cannot get my solution to run in less than about 1 minute 30 secs.
(I'm aware there are various short-cuts and even paper-and-pencil solutions; for the purpose of this question I'm not considering those).
In the best version I've come up with so far, profiling shows that the majority of the time is spent in foldDigits. This function need not be lazy at all, and to my mind ought to be optimized to a simple loop. As you can see I've attempted to make various bits of the program strict.
So my question is: without changing the overall algorithm, is there some way to bring the execution time of this program down to the sub-minute mark?
(Or if not, is there a way to see that the code of foldDigits is as optimized as possible?)
-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4
{-# LANGUAGE BangPatterns #-}
import Control.Parallel.Strategies
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
| n < 10 = i
| otherwise = foldDigits f i d
where (d, m) = n `quotRem` 10
!i = f acc m
reverseNumber :: Int -> Int
reverseNumber !n
= foldDigits accumulate 0 n
where accumulate !v !d = v * 10 + d
allDigitsOdd :: Int -> Bool
allDigitsOdd n
= foldDigits andOdd True n
where andOdd !a d = a && isOdd d
isOdd !x = x `rem` 2 /= 0
isReversible :: Int -> Bool
isReversible n
= notDivisibleByTen n && allDigitsOdd (n + rn)
where rn = reverseNumber n
notDivisibleByTen !x = x `rem` 10 /= 0
countRange acc start end
| start > end = acc
| otherwise = countRange (acc + v) (start + 1) end
where v = if isReversible start then 1 else 0
main
= print $ sum $ parMap rseq cr ranges
where max = 1000000000
qmax = max `div` 4
ranges = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
cr (s, e) = countRange 0 s e
As it stands, the core that ghc-7.6.1 produces for foldDigits (with -O2) is
Rec {
$wfoldDigits_r2cK
:: forall a_aha.
(a_aha -> GHC.Types.Int -> a_aha)
-> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
\ (# a_aha)
(w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
(w1_s285 :: a_aha)
(ww_s288 :: GHC.Prim.Int#) ->
case w1_s285 of acc_Xhi { __DEFAULT ->
let {
ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
[LclId, Str=DmdType]
ds_sNo =
case GHC.Prim.quotRemInt# ww_s288 10
of _ { (# ipv_aJA, ipv1_aJB #) ->
(GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
} } in
case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
of i_ahg { __DEFAULT ->
case GHC.Prim.<# ww_s288 10 of _ {
GHC.Types.False ->
case ds_sNo of _ { (d_Xsi, m_Xs5) ->
case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
$wfoldDigits_r2cK # a_aha w_s284 i_ahg ww1_X28L
}
};
GHC.Types.True -> i_ahg
}
}
}
end Rec }
which, as you can see, re-boxes the result of the quotRem call. The problem is that no property of f is available here, and as a recursive function, foldDigits cannot be inlined.
With a manual worker-wrapper transform making the function argument static,
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
where
go !acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (f acc r) q
foldDigits becomes inlinable, and you get specialised versions for your uses operating on unboxed data, but no top-level foldDigits, e.g.
Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
\ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
case ww1_s28J of ds_XJh {
__DEFAULT ->
case GHC.Prim.quotRemInt# ds_XJh 10
of _ { (# ipv_aJK, ipv1_aJL #) ->
$wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
};
0 -> ww_s28F
}
end Rec }
and the effect on computation time is tangible, for the original, I got
$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
196,407,088 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1827331 colls, 1827331 par 23.77s 11.86s 0.0000s 0.0041s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 54.94% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 620.52s (313.51s elapsed)
GC time 23.77s ( 11.86s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 644.29s (325.37s elapsed)
Alloc rate 2,923,834,808 bytes per MUT second
(I used -N2 since my i5 only has two physical cores), vs.
$ ./eul145 +RTS -s -N2
608720
16,000,063,624 bytes allocated in the heap
403,384 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 15852 colls, 15852 par 0.34s 0.17s 0.0000s 0.0037s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 43.86% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 314.85s (160.08s elapsed)
GC time 0.34s ( 0.17s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 315.20s (160.25s elapsed)
Alloc rate 50,817,657 bytes per MUT second
Productivity 99.9% of total user, 196.5% of total elapsed
with the modification. The running time roughly halved, and the allocations reduced 100-fold.

Is indexing of Data.Vector.Unboxed.Mutable.MVector really this slow?

I have an application that spends about 80% of its time computing the centroid of a large list (10^7) of high dimensional vectors (dim=100) using the Kahan summation algorithm. I have done my best at optimizing the summation, but it is still 20x slower than an equivalent C implementation. Profiling indicates that the culprits are the unsafeRead and unsafeWrite functions from Data.Vector.Unboxed.Mutable. My question is: are these functions really this slow or am I misunderstanding the profiling statistics?
Here are the two implementations. The Haskell one is compiled with ghc-7.0.3 using the llvm backend. The C one is compiled with llvm-gcc.
Kahan summation in Haskell:
{-# LANGUAGE BangPatterns #-}
module Test where
import Control.Monad ( mapM_ )
import Data.Vector.Unboxed ( Vector, Unbox )
import Data.Vector.Unboxed.Mutable ( MVector )
import qualified Data.Vector.Unboxed as U
import qualified Data.Vector.Unboxed.Mutable as UM
import Data.Word ( Word )
import Data.Bits ( shiftL, shiftR, xor )
prng :: Word -> Word
prng w = w' where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
mkVect :: Word -> Vector Double
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
foldV :: (Unbox a, Unbox b)
=> (a -> b -> a) -- componentwise function to fold
-> Vector a -- initial accumulator value
-> [Vector b] -- data vectors
-> Vector a -- final accumulator value
foldV fn accum vs = U.modify (\x -> mapM_ (liftV fn x) vs) accum where
liftV f acc = fV where
fV v = go 0 where
n = min (U.length v) (UM.length acc)
go i | i < n = step >> go (i + 1)
| otherwise = return ()
where
step = {-# SCC "fV_step" #-} do
a <- {-# SCC "fV_read" #-} UM.unsafeRead acc i
b <- {-# SCC "fV_index" #-} U.unsafeIndexM v i
{-# SCC "fV_write" #-} UM.unsafeWrite acc i $! {-# SCC "fV_apply" #-} f a b
kahan :: [Vector Double] -> Vector Double
kahan [] = U.singleton 0.0
kahan (v:vs) = fst . U.unzip $ foldV kahanStep acc vs where
acc = U.map (\z -> (z, 0.0)) v
kahanStep :: (Double, Double) -> Double -> (Double, Double)
kahanStep (s, c) x = (s', c') where
!y = x - c
!s' = s + y
!c' = (s' - s) - y
{-# NOINLINE kahanStep #-}
zero :: U.Vector Double
zero = U.replicate 100 0.0
myLoop n = kahan $ map mkVect [1..n]
main = print $ myLoop 100000
Compiling with ghc-7.0.3 using the llvm backend:
ghc -o Test_hs --make -fforce-recomp -O3 -fllvm -optlo-O3 -msse2 -main-is Test.main Test.hs
time ./Test_hs
real 0m1.948s
user 0m1.936s
sys 0m0.008s
Profiling information:
16,710,594,992 bytes allocated in the heap
33,047,064 bytes copied during GC
35,464 bytes maximum residency (1 sample(s))
23,888 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 31907 collections, 0 parallel, 0.28s, 0.27s elapsed
Generation 1: 1 collections, 0 parallel, 0.00s, 0.00s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 24.73s ( 24.74s elapsed)
GC time 0.28s ( 0.27s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 25.01s ( 25.02s elapsed)
%GC time 1.1% (1.1% elapsed)
Alloc rate 675,607,179 bytes per MUT second
Productivity 98.9% of total user, 98.9% of total elapsed
Thu Feb 23 02:42 2012 Time and Allocation Profiling Report (Final)
Test_hs +RTS -s -p -RTS
total time = 24.60 secs (1230 ticks # 20 ms)
total alloc = 8,608,188,392 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
fV_write Test 31.1 26.0
fV_read Test 27.2 23.2
mkVect Test 12.3 27.2
fV_step Test 11.7 0.0
foldV Test 5.9 5.7
fV_index Test 5.2 9.3
kahanStep Test 3.3 6.5
prng Test 2.2 1.8
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 1 0 0.0 0.0 100.0 100.0
CAF:main1 Test 339 1 0.0 0.0 0.0 0.0
main Test 346 1 0.0 0.0 0.0 0.0
CAF:main2 Test 338 1 0.0 0.0 100.0 100.0
main Test 347 0 0.0 0.0 100.0 100.0
myLoop Test 348 1 0.2 0.2 100.0 100.0
mkVect Test 350 400000 12.3 27.2 14.5 29.0
prng Test 351 9900000 2.2 1.8 2.2 1.8
kahan Test 349 102 0.0 0.0 85.4 70.7
foldV Test 359 1 5.9 5.7 85.4 70.7
fV_step Test 360 9999900 11.7 0.0 79.5 65.1
fV_write Test 367 19999800 31.1 26.0 35.4 32.5
fV_apply Test 368 9999900 1.0 0.0 4.3 6.5
kahanStep Test 369 9999900 3.3 6.5 3.3 6.5
fV_index Test 366 9999900 5.2 9.3 5.2 9.3
fV_read Test 361 9999900 27.2 23.2 27.2 23.2
CAF:lvl19_r3ei Test 337 1 0.0 0.0 0.0 0.0
kahan Test 358 0 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad3_r3eg Test 336 1 0.0 0.0 0.0 0.0
kahan Test 357 0 0.0 0.0 0.0 0.0
CAF:$dMVector2_r3ee Test 335 1 0.0 0.0 0.0 0.0
CAF:$dVector1_r3ec Test 334 1 0.0 0.0 0.0 0.0
CAF:poly_$dMonad_r3ea Test 333 1 0.0 0.0 0.0 0.0
CAF:$dMVector1_r3e2 Test 330 1 0.0 0.0 0.0 0.0
CAF:poly_$dPrimMonad2_r3e0 Test 328 1 0.0 0.0 0.0 0.0
foldV Test 365 0 0.0 0.0 0.0 0.0
CAF:lvl11_r3dM Test 322 1 0.0 0.0 0.0 0.0
kahan Test 354 0 0.0 0.0 0.0 0.0
CAF:lvl10_r3dK Test 321 1 0.0 0.0 0.0 0.0
kahan Test 355 0 0.0 0.0 0.0 0.0
CAF:$dMVector_r3dI Test 320 1 0.0 0.0 0.0 0.0
kahan Test 356 0 0.0 0.0 0.0 0.0
CAF GHC.Float 297 1 0.0 0.0 0.0 0.0
CAF GHC.IO.Handle.FD 256 2 0.0 0.0 0.0 0.0
CAF GHC.IO.Encoding.Iconv 214 2 0.0 0.0 0.0 0.0
CAF GHC.Conc.Signal 211 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Generic 182 1 0.0 0.0 0.0 0.0
CAF Data.Vector.Unboxed 174 2 0.0 0.0 0.0 0.0
The equivalent implementation in C:
#include <stdint.h>
#include <stdio.h>
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahanStep (double *s, double *c, double x) {
double y, t;
y = x - *c;
t = *s + y;
*c = (t - *s) - y;
*s = t;
}
void kahan(double s[], double c[]) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
kahanStep(&s[j], &c[j], w);
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled with llvm-gcc:
>llvm-gcc -o Test_c -O3 -msse2 -std=c99 test.c
>time ./Test_c
real 0m0.096s
user 0m0.088s
sys 0m0.004s
Update 1: I un-inlined kahanStep in the C version. It barely made a dent in the performance. I hope that now we can all acknowledge Amdahl's law and move on. As
inefficient as kahanStep might be, unsafeRead and unsafeWrite are 9-10x slower. I was hoping someone could shed some light on the possible causes of that fact.
Also, I should say that since I am interacting with a library that uses Data.Vector.Unboxed, so I am kinda married to it at this point, and parting with it would be very traumatic :-)
Update 2: I guess I was not clear enough in my original question. I am not looking for ways to speed up this microbenchmark. I am looking for an explanation of the counter intuitive profiling stats, so I can decide whether or not to file a bug report against vector.
Your C version is not equivalent to your Haskell implementation. In C you've inlined the important Kahan summation step yourself, in Haskell you created a polymorphic higher order function that does a lot more and takes the transformation step as a parameter. Moving kahanStep to a separate function in C isn't the point, it will still be inlined by the compiler. Even if you put it into its own source file, compile separately and link without link-time optimisation, you have only addressed part of the difference.
I have made a C version that is closer to the Haskell version,
kahan.h:
typedef struct DPair_st {
double fst, snd;
} DPair;
DPair kahanStep(DPair pr, double x);
kahanStep.c:
#include "kahan.h"
DPair kahanStep (DPair pr, double x) {
double y, t;
y = x - pr.snd;
t = pr.fst + y;
pr.snd = (t - pr.fst) - y;
pr.fst = t;
return pr;
}
main.c:
#include <stdint.h>
#include <stdio.h>
#include "kahan.h"
#define VDIM 100
#define VNUM 100000
uint64_t prng (uint64_t w) {
w ^= w << 13;
w ^= w >> 7;
w ^= w << 17;
return w;
};
void kahan(double s[], double c[], DPair (*fun)(DPair,double)) {
for (int i = 1; i <= VNUM; i++) {
uint64_t w = i;
for (int j = 0; j < VDIM; j++) {
DPair pr;
pr.fst = s[j];
pr.snd = c[j];
pr = fun(pr,w);
s[j] = pr.fst;
c[j] = pr.snd;
w = prng(w);
}
}
};
int main (int argc, char* argv[]) {
double acc[VDIM], err[VDIM];
for (int i = 0; i < VDIM; i++) {
acc[i] = err[i] = 0.0;
};
kahan(acc, err,kahanStep);
printf("[ ");
for (int i = 0; i < VDIM; i++) {
printf("%g ", acc[i]);
};
printf("]\n");
};
Compiled separately and linked, that runs about 25% slower than the first C version here (0.1s vs. 0.079s).
Now you have a higher order function in C, considerably slower than the original, but still much faster than the Haskell code. One important difference is that the C function takes an unboxed pair of doubles and an unboxed double as arguments, while the Haskell kahanStep takes a boxed pair of boxed Doubles and a boxed Double and returns a boxed pair of boxed Doubles, requiring expensive boxing and unboxing in the foldV loop. That is addressable by more inlining. Explicitly inlining foldV, kahanStep, and step brings the time down from 0.90s to 0.74s here with ghc-7.0.4 (it has a smaller effect on ghc-7.4.1's output, from 0.99s down to 0.90s).
But the boxing and unboxing is, alas, the smaller part of the difference. foldV does much more than C's kahan, it takes a list of vectors used to modify the accumulator. That list of vectors is completely absent in the C code, and that makes a big difference. All these 100000 vectors have to be allocated, filled and put into a list (due to laziness, not all of them are simultaneously alive, so there's no space problem, but they, as well as the list cells, have to be allocated and garbage collected, which takes considerable time). And in the loop proper, instead of having a Word# passed in a register, the precomputed value is read from the vector.
If you use a more direct translation of the C to Haskell,
{-# LANGUAGE CPP, BangPatterns #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
prng :: Word -> Word
prng w = w'
where
!w1 = w `xor` (w `shiftL` 13)
!w2 = w1 `xor` (w1 `shiftR` 7)
!w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
!cj <- unsafeRead c j
!sj <- unsafeRead s j
let !y = fromIntegral w - cj
!t = sj + y
!w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
it's much faster. Admittedly it's still about three times slower than the C, but the original was 13 times slower here (and I don't have llvm installed, so I use vanilla gcc and the native backed of GHC, using llvm may give slightly different results).
I don't think that indexing is really the culprit. The vector package heavily relies on compiler magic, but compiling for profiling support massively interferes with that. For packages like vector or bytestring which use their own fusion framework for optimisation, the profiling interference can be rather disastrous and the profiling results utterly useless. I'm inclined to believe we have such a case here.
In the Core, all reads and writes are transformed to the primops readDoubleArray#, indexDoubleArray# and writeDoubleArray#, which are fast. Maybe a bit slower than a C array access, but not very much. So I'm confident that that's not the problem and the cause of the big difference. But you have put {-# SCC #-} annotations on them, so disabling any optimisation involving a rearrangement of any of those terms. And each time one of these points is entered, it has to be recorded. I'm not familiar enough with the profiler and optimiser to know what exactly happens, but, as a data point, with the {-# INLINE #-} pragmas on foldV, step and kahanStep, a profiling run with these SCCs took 3.17s, and with the SCCs fV_step, fV_read, fV_index, fV_write and fV_apply removed (nothing else changed) a profiling run took only 2.03s (both times as reported by +RTS -P, so with the profiling overhead subtracted). That difference shows that SCCs on cheap functions and too fine-grained SCCs can massively skew the profiling results. Now if we also put {-# INLINE #-} pragmas on mkVect, kahan and prng, we are left with a completely uninformative profile, but the run takes only 1.23s. (These last inlinings have, however, no effect for the non-profiling runs, without profiling, they are inlined automatically.)
So, don't take the profiling results as unquestionable truths. The more your code (directly or indirectly through the libraries used) depends on optimisations, the more it is vulnerable to misleading profiling results caused by disabled optimisations. This also holds, but to a much lesser extent, for heap-profiling to pin down space leaks.
When you have a suspicious profiling result, check what happens when you remove some SCCs. If that results in a big drop of run time, that SCC was not your primary problem (it may become a problem again after other problems have been fixed).
Looking at the Core generated for your programme, what jumped out was that your kahanStep - by the way, remove the {-# NOINLINE #-} pragma from that, it's counter-productive - produced a boxed pair of boxed Doubles in the loop, that was immediately deconstructed and the components unboxed. Such unnecessary intermediate boxing of values is expensive and slows computations down a lot.
As this came up on haskell-cafe again today where somebody got terrible performance from the above code with ghc-7.4.1, tibbe took it upon himself to investigate the core that GHC produced and found that GHC produced suboptimal code for the conversion from Word to Double. Replacing the fromIntegral of the conversion with a custom conversion using only (wrapped) primitives (and removing the bang patterns that don't make a difference here, GHC's strictness analyser is good enough to see through the algorithm, I should learn to trust it more ;), we obtain a version that is on par with gcc -O3's output for the original C:
{-# LANGUAGE CPP #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import GHC.Word
import Control.Monad
import Data.Bits
import GHC.Float (int2Double)
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
forM_ [1 .. VNUM] $ \i -> inner (fromIntegral i) 0
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
main :: IO ()
main = print . elems $ runSTUArray calc
There is a funny mixing in of list combinators in all of this seemingly Data.Vector code. If I make the very first obvious emendation, replacing
mkVect = U.force . U.map fromIntegral . U.fromList . take 100 . iterate prng
with the correct use of Data.Vector.Unboxed:
mkVect = U.force . U.map fromIntegral . U.iterateN 100 prng
then my time falls by two thirds -- from real 0m1.306s to real 0m0.429s It looks like all of the top level functions have this problem except prng and zero
This came up on the mailing lists and I discovered that there's a bug in the Word->Double conversion code in GHC 7.4.1 (at least). This version, which works around the bug, is as fast as the C code on my machine:
{-# LANGUAGE CPP, BangPatterns, MagicHash #-}
module Main (main) where
#define VDIM 100
#define VNUM 100000
import Control.Monad.ST
import Data.Array.Base
import Data.Array.ST
import Data.Bits
import GHC.Word
import GHC.Exts
prng :: Word -> Word
prng w = w'
where
w1 = w `xor` (w `shiftL` 13)
w2 = w1 `xor` (w1 `shiftR` 7)
w' = w2 `xor` (w2 `shiftL` 17)
type Vec s = STUArray s Int Double
kahan :: Vec s -> Vec s -> ST s ()
kahan s c = do
let inner !w j
| j < VDIM = do
cj <- unsafeRead c j
sj <- unsafeRead s j
let y = word2Double w - cj
t = sj + y
w' = prng w
unsafeWrite c j ((t-sj)-y)
unsafeWrite s j t
inner w' (j+1)
| otherwise = return ()
outer i | i <= VNUM = inner (fromIntegral i) 0 >> outer (i + 1)
| otherwise = return ()
outer (1 :: Int)
calc :: ST s (Vec s)
calc = do
s <- newArray (0,VDIM-1) 0
c <- newArray (0,VDIM-1) 0
kahan s c
return s
main :: IO ()
main = print . elems $ runSTUArray calc
{- I originally used this function, which isn't quite correct.
We need a real bug fix in GHC.
word2Double :: Word -> Double
word2Double (W# w) = D# (int2Double# (word2Int# w))
-}
correction :: Double
correction = 2 * int2Double minBound
word2Double :: Word -> Double
word2Double w = case fromIntegral w of
i | i < 0 -> int2Double i - correction
| otherwise -> int2Double i
Other than working around the Word->Double bug, I've also removed extra lists to match the C version better.
I know you didn't ask for a way to improve this micro-benchmark, but I'll give you an explanation that might prove helpful when writing loops in the future:
An unknown function call, such as the one made to the higher-order argument of foldV, can be expensive when done frequently in a loop. In particular, it will inhibit unboxing of the function arguments, leading to increased allocation. The reason it inhibits argument unboxing is that we don't know that the function we're calling is strict in those arguments and thus we pass the arguments as e.g. (Double, Double), instead of as Double# -> Double#.
The compiler can figure out the strictness information if the loop (e.g. foldV) meets the loop body (e.g. kahanStep). For that reason I recommend that people INLINE higher-order functions. In this case, inlining foldV and removing the NOINLINE on kahanStep improves the runtime quite a bit for me.
This doesn't bring the performance on par with C in this case, as there are other things going on (as others have commented on), but it's a step in the right direction (and it's a step you can do without every having to look at profiling output).

Resources