Scala recursion vs loop: performance and runtime considerations - performance

I've wrote a naïve test-bed to measure the performance of three kinds of factorial implementation: loop based, non tail-recursive and tail-recursive.
Surprisingly to me the worst performant was the loop ones («while» was expected to be more efficient so I provided both) that cost
almost twice than the tail recursive alternative.
*ANSWER: fixing the loop implementation avoiding the = operator which outperform worst with BigInt due to its internals «loops» became fastest as expected
Another «woodoo» behavior I've experienced was the StackOverflow
exception which wasn't thrown systematically for the same input in the
case of non-tail recursive implementation. I can circumvent the
StackOverlow by progressively call the function with larger and larger
values… I feel crazy :) Answer: JVM require to converge during startup, then behavior is coherent and systematic
This is the code:
final object Factorial {
type Out = BigInt
def calculateByRecursion(n: Int): Out = {
require(n>0, "n must be positive")
n match {
case _ if n == 1 => return 1
case _ => return n * calculateByRecursion(n-1)
}
}
def calculateByForLoop(n: Int): Out = {
require(n>0, "n must be positive")
var accumulator: Out = 1
for (i <- 1 to n)
accumulator = i * accumulator
accumulator
}
def calculateByWhileLoop(n: Int): Out = {
require(n>0, "n must be positive")
var accumulator: Out = 1
var i = 1
while (i <= n) {
accumulator = i * accumulator
i += 1
}
accumulator
}
def calculateByTailRecursion(n: Int): Out = {
require(n>0, "n must be positive")
#tailrec def fac(n: Int, acc: Out): Out = n match {
case _ if n == 1 => acc
case _ => fac(n-1, n * acc)
}
fac(n, 1)
}
def calculateByTailRecursionUpward(n: Int): Out = {
require(n>0, "n must be positive")
#tailrec def fac(i: Int, acc: Out): Out = n match {
case _ if i == n => n * acc
case _ => fac(i+1, i * acc)
}
fac(1, 1)
}
def comparePerformance(n: Int) {
def showOutput[A](msg: String, data: (Long, A), showOutput:Boolean = false) =
showOutput match {
case true => printf("%s returned %s in %d ms\n", msg, data._2.toString, data._1)
case false => printf("%s in %d ms\n", msg, data._1)
}
def measure[A](f:()=>A): (Long, A) = {
val start = System.currentTimeMillis
val o = f()
(System.currentTimeMillis - start, o)
}
showOutput ("By for loop", measure(()=>calculateByForLoop(n)))
showOutput ("By while loop", measure(()=>calculateByWhileLoop(n)))
showOutput ("By non-tail recursion", measure(()=>calculateByRecursion(n)))
showOutput ("By tail recursion", measure(()=>calculateByTailRecursion(n)))
showOutput ("By tail recursion upward", measure(()=>calculateByTailRecursionUpward(n)))
}
}
What follows is some output from sbt console (Before «while» implementation):
scala> example.Factorial.comparePerformance(10000)
By loop in 3 ns
By non-tail recursion in >>>>> StackOverflow!!!!!… see later!!!
........
scala> example.Factorial.comparePerformance(1000)
By loop in 3 ms
By non-tail recursion in 1 ms
By tail recursion in 4 ms
scala> example.Factorial.comparePerformance(5000)
By loop in 105 ms
By non-tail recursion in 27 ms
By tail recursion in 34 ms
scala> example.Factorial.comparePerformance(10000)
By loop in 236 ms
By non-tail recursion in 106 ms >>>> Now works!!!
By tail recursion in 127 ms
scala> example.Factorial.comparePerformance(20000)
By loop in 977 ms
By non-tail recursion in 495 ms
By tail recursion in 564 ms
scala> example.Factorial.comparePerformance(30000)
By loop in 2285 ms
By non-tail recursion in 1183 ms
By tail recursion in 1281 ms
What follows is some output from sbt console (After «while» implementation):
scala> example.Factorial.comparePerformance(10000)
By for loop in 252 ms
By while loop in 246 ms
By non-tail recursion in 130 ms
By tail recursion in 136 ns
scala> example.Factorial.comparePerformance(20000)
By for loop in 984 ms
By while loop in 1091 ms
By non-tail recursion in 508 ms
By tail recursion in 560 ms
What follows is some output from sbt console (after «upward» tail recursion implementation) the world come back sane:
scala> example.Factorial.comparePerformance(10000)
By for loop in 259 ms
By while loop in 229 ms
By non-tail recursion in 114 ms
By tail recursion in 119 ms
By tail recursion upward in 105 ms
scala> example.Factorial.comparePerformance(20000)
By for loop in 1053 ms
By while loop in 957 ms
By non-tail recursion in 513 ms
By tail recursion in 565 ms
By tail recursion upward in 470 ms
What follows is some output from sbt console after fixing BigInt multiplication in «loops»: the world is totally sane:
scala> example.Factorial.comparePerformance(20000)
By for loop in 498 ms
By while loop in 502 ms
By non-tail recursion in 521 ms
By tail recursion in 611 ms
By tail recursion upward in 503 ms
BigInt overhead and a stupid implementation by me masked the expected behavior.
PS.: In the end I should re-title this post to «A lernt lesson on BigInts»

For loops are not actually quite loops; they're for comprehensions on a range. If you actually want a loop, you need to use while. (Actually, I think the BigInt multiplication here is heavyweight enough so it shouldn't matter. But you'll notice if you're multiplying Ints.)
Also, you have confused yourself by using BigInt. The bigger your BigInt is, the slower your multiplication. So your non-tail loop counts up while your tail recursion loop counds down which means that the latter has more big numbers to multiply.
If you fix these two issues you will find that sanity is restored: loops and tail recursion are the same speed, with both regular recursion and for slower. (Regular recursion may not be slower if the JVM optimization makes it equivalent)
(Also, the stack overflow fix is probably because the JVM starts inlining and may either make the call tail-recursive itself, or unrolls the loop far enough so that you don't overflow any longer.)
Finally, you're getting poor results with for and while because you're multiplying on the right rather than the left with the small number. It turns out that the Java's BigInt multiplies faster with the smaller number on the left.

Scala static methods for factorial(n) (coded with scala 2.12.x, java-8):
object Factorial {
/*
* For large N, it throws a stack overflow
*/
def recursive(n:BigInt): BigInt = {
if(n < 0) {
throw new ArithmeticException
} else if(n <= 1) {
1
} else {
n * recursive(n - 1)
}
}
/*
* A tail recursive method is compiled to avoid stack overflow
*/
#scala.annotation.tailrec
def recursiveTail(n:BigInt, acc:BigInt = 1): BigInt = {
if(n < 0) {
throw new ArithmeticException
} else if(n <= 1) {
acc
} else {
recursiveTail(n - 1, n * acc)
}
}
/*
* A while loop
*/
def loop(n:BigInt): BigInt = {
if(n < 0) {
throw new ArithmeticException
} else if(n <= 1) {
1
} else {
var acc = 1
var idx = 1
while(idx <= n) {
acc = idx * acc
idx += 1
}
acc
}
}
}
Specs:
class FactorialSpecs extends SpecHelper {
private val smallInt = 10
private val largeInt = 10000
describe("Factorial.recursive") {
it("return 1 for 0") {
assert(Factorial.recursive(0) == 1)
}
it("return 1 for 1") {
assert(Factorial.recursive(1) == 1)
}
it("return 2 for 2") {
assert(Factorial.recursive(2) == 2)
}
it("returns a result, for small inputs") {
assert(Factorial.recursive(smallInt) == 3628800)
}
it("throws StackOverflow for large inputs") {
intercept[java.lang.StackOverflowError] {
Factorial.recursive(Int.MaxValue)
}
}
}
describe("Factorial.recursiveTail") {
it("return 1 for 0") {
assert(Factorial.recursiveTail(0) == 1)
}
it("return 1 for 1") {
assert(Factorial.recursiveTail(1) == 1)
}
it("return 2 for 2") {
assert(Factorial.recursiveTail(2) == 2)
}
it("returns a result, for small inputs") {
assert(Factorial.recursiveTail(smallInt) == 3628800)
}
it("returns a result, for large inputs") {
assert(Factorial.recursiveTail(largeInt).isInstanceOf[BigInt])
}
}
describe("Factorial.loop") {
it("return 1 for 0") {
assert(Factorial.loop(0) == 1)
}
it("return 1 for 1") {
assert(Factorial.loop(1) == 1)
}
it("return 2 for 2") {
assert(Factorial.loop(2) == 2)
}
it("returns a result, for small inputs") {
assert(Factorial.loop(smallInt) == 3628800)
}
it("returns a result, for large inputs") {
assert(Factorial.loop(largeInt).isInstanceOf[BigInt])
}
}
}
Benchmarks:
import org.scalameter.api._
class BenchmarkFactorials extends Bench.OfflineReport {
val gen: Gen[Int] = Gen.range("N")(1, 1000, 100) // scalastyle:ignore
performance of "Factorial" in {
measure method "loop" in {
using(gen) in {
n => Factorial.loop(n)
}
}
measure method "recursive" in {
using(gen) in {
n => Factorial.recursive(n)
}
}
measure method "recursiveTail" in {
using(gen) in {
n => Factorial.recursiveTail(n)
}
}
}
}
Benchmark results (loop is much faster):
[info] Test group: Factorial.loop
[info] - Factorial.loop.Test-9 measurements:
[info] - at N -> 1: passed
[info] (mean = 0.01 ms, ci = <0.00 ms, 0.02 ms>, significance = 1.0E-10)
[info] - at N -> 101: passed
[info] (mean = 0.01 ms, ci = <0.01 ms, 0.01 ms>, significance = 1.0E-10)
[info] - at N -> 201: passed
[info] (mean = 0.02 ms, ci = <0.02 ms, 0.02 ms>, significance = 1.0E-10)
[info] - at N -> 301: passed
[info] (mean = 0.03 ms, ci = <0.02 ms, 0.03 ms>, significance = 1.0E-10)
[info] - at N -> 401: passed
[info] (mean = 0.03 ms, ci = <0.03 ms, 0.04 ms>, significance = 1.0E-10)
[info] - at N -> 501: passed
[info] (mean = 0.04 ms, ci = <0.03 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 601: passed
[info] (mean = 0.04 ms, ci = <0.04 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 701: passed
[info] (mean = 0.05 ms, ci = <0.05 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 801: passed
[info] (mean = 0.06 ms, ci = <0.05 ms, 0.06 ms>, significance = 1.0E-10)
[info] - at N -> 901: passed
[info] (mean = 0.06 ms, ci = <0.05 ms, 0.07 ms>, significance = 1.0E-10)
[info] Test group: Factorial.recursive
[info] - Factorial.recursive.Test-10 measurements:
[info] - at N -> 1: passed
[info] (mean = 0.00 ms, ci = <0.00 ms, 0.01 ms>, significance = 1.0E-10)
[info] - at N -> 101: passed
[info] (mean = 0.05 ms, ci = <0.01 ms, 0.09 ms>, significance = 1.0E-10)
[info] - at N -> 201: passed
[info] (mean = 0.03 ms, ci = <0.02 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 301: passed
[info] (mean = 0.07 ms, ci = <0.00 ms, 0.13 ms>, significance = 1.0E-10)
[info] - at N -> 401: passed
[info] (mean = 0.09 ms, ci = <0.01 ms, 0.18 ms>, significance = 1.0E-10)
[info] - at N -> 501: passed
[info] (mean = 0.10 ms, ci = <0.03 ms, 0.17 ms>, significance = 1.0E-10)
[info] - at N -> 601: passed
[info] (mean = 0.11 ms, ci = <0.08 ms, 0.15 ms>, significance = 1.0E-10)
[info] - at N -> 701: passed
[info] (mean = 0.13 ms, ci = <0.11 ms, 0.14 ms>, significance = 1.0E-10)
[info] - at N -> 801: passed
[info] (mean = 0.16 ms, ci = <0.13 ms, 0.19 ms>, significance = 1.0E-10)
[info] - at N -> 901: passed
[info] (mean = 0.21 ms, ci = <0.15 ms, 0.27 ms>, significance = 1.0E-10)
[info] Test group: Factorial.recursiveTail
[info] - Factorial.recursiveTail.Test-11 measurements:
[info] - at N -> 1: passed
[info] (mean = 0.00 ms, ci = <0.00 ms, 0.01 ms>, significance = 1.0E-10)
[info] - at N -> 101: passed
[info] (mean = 0.04 ms, ci = <0.03 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 201: passed
[info] (mean = 0.12 ms, ci = <0.05 ms, 0.20 ms>, significance = 1.0E-10)
[info] - at N -> 301: passed
[info] (mean = 0.16 ms, ci = <-0.03 ms, 0.34 ms>, significance = 1.0E-10)
[info] - at N -> 401: passed
[info] (mean = 0.12 ms, ci = <0.09 ms, 0.16 ms>, significance = 1.0E-10)
[info] - at N -> 501: passed
[info] (mean = 0.17 ms, ci = <0.15 ms, 0.19 ms>, significance = 1.0E-10)
[info] - at N -> 601: passed
[info] (mean = 0.23 ms, ci = <0.19 ms, 0.26 ms>, significance = 1.0E-10)
[info] - at N -> 701: passed
[info] (mean = 0.25 ms, ci = <0.18 ms, 0.32 ms>, significance = 1.0E-10)
[info] - at N -> 801: passed
[info] (mean = 0.28 ms, ci = <0.21 ms, 0.36 ms>, significance = 1.0E-10)
[info] - at N -> 901: passed
[info] (mean = 0.32 ms, ci = <0.17 ms, 0.46 ms>, significance = 1.0E-10)

I know everyone already answered the question, but I thought that I might add this one optimization: If you convert the pattern matching to simple if-statements, it can speed up the tail recursion.
final object Factorial {
type Out = BigInt
def calculateByRecursion(n: Int): Out = {
require(n>0, "n must be positive")
n match {
case _ if n == 1 => return 1
case _ => return n * calculateByRecursion(n-1)
}
}
def calculateByForLoop(n: Int): Out = {
require(n>0, "n must be positive")
var accumulator: Out = 1
for (i <- 1 to n)
accumulator = i * accumulator
accumulator
}
def calculateByWhileLoop(n: Int): Out = {
require(n>0, "n must be positive")
var acc: Out = 1
var i = 1
while (i <= n) {
acc = i * acc
i += 1
}
acc
}
def calculateByTailRecursion(n: Int): Out = {
require(n>0, "n must be positive")
#annotation.tailrec
def fac(n: Int, acc: Out): Out = if (n==1) acc else fac(n-1, n*acc)
fac(n, 1)
}
def calculateByTailRecursionUpward(n: Int): Out = {
require(n>0, "n must be positive")
#annotation.tailrec
def fac(i: Int, acc: Out): Out = if (i == n) n*acc else fac(i+1, i*acc)
fac(1, 1)
}
def attempt(f: ()=>Unit): Boolean = {
try {
f()
true
} catch {
case _: Throwable =>
println(" <<<<< Failed...")
false
}
}
def comparePerformance(n: Int) {
def showOutput[A](msg: String, data: (Long, A), showOutput:Boolean = true) =
showOutput match {
case true =>
val res = data._2.toString
val pref = res.substring(0,5)
val midd = res.substring((res.length-5)/ 2, (res.length-5)/ 2 + 10)
val suff = res.substring(res.length-5)
printf("%s returned %s in %d ms\n", msg, s"$pref...$midd...$suff" , data._1)
case false =>
printf("%s in %d ms\n", msg, data._1)
}
def measure[A](f:()=>A): (Long, A) = {
val start = System.currentTimeMillis
val o = f()
(System.currentTimeMillis - start, o)
}
attempt(() => showOutput ("By for loop", measure(()=>calculateByForLoop(n))))
attempt(() => showOutput ("By while loop", measure(()=>calculateByWhileLoop(n))))
attempt(() => showOutput ("By non-tail recursion", measure(()=>calculateByRecursion(n))))
attempt(() => showOutput ("By tail recursion", measure(()=>calculateByTailRecursion(n))))
attempt(() => showOutput ("By tail recursion upward", measure(()=>calculateByTailRecursionUpward(n))))
}
}
My results:
scala> Factorial.comparePerformance(20000)
By for loop returned 18192...5708616582...00000 in 179 ms
By while loop returned 18192...5708616582...00000 in 159 ms
By non-tail recursion <<<<< Failed...
By tail recursion returned 18192...5708616582...00000 in 169 ms
By tail recursion upward returned 18192...5708616582...00000 in 174 ms
By for loop returned 18192...5708616582...00000 in 212 ms
By while loop returned 18192...5708616582...00000 in 156 ms
By non-tail recursion returned 18192...5708616582...00000 in 155 ms
By tail recursion returned 18192...5708616582...00000 in 166 ms
By tail recursion upward returned 18192...5708616582...00000 in 137 ms
scala> Factorial.comparePerformance(200000)
By for loop returned 14202...0169293868...00000 in 17467 ms
By while loop returned 14202...0169293868...00000 in 17303 ms
By non-tail recursion <<<<< Failed...
By tail recursion returned 14202...0169293868...00000 in 18477 ms
By tail recursion upward returned 14202...0169293868...00000 in 17188 ms

Related

How to optimize a function and minimize allocations

The following function generates primes up to N. For large N, this becomes quite slow, my Julia implementation is 5X faster for N = 10**7. I guess the creation of a large integer array and using pack to collect the result is the slowest part. I tried counting .true.s first, then allocating res(:) and populating it using a loop, but the speedup was negligible (4%) as I iterate the prims array twice in this case. In Julia, I used findall which does exactly what I did; iterating the array twice, first counting trues and allocationg result then populating it. Any ideas? Thank you.
Compiler:
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.3.210 Build 20180410 (on Windows 10)
Options: ifort -warn /O3 -heap-arrays:8000000
program main
implicit none
integer, allocatable :: primes(:)
integer :: t0, t1, count_rate, count_max
call system_clock(t0, count_rate, count_max)
primes = do_primes(10**7)
call system_clock(t1)
print '(a,f7.5,a)', 'Elapsed time: ', real(t1-t0)/count_rate, ' seconds'
print *, primes(1:10)
contains
function do_primes(N) result (res)
integer, allocatable :: res(:), array(:)
logical, allocatable :: prims(:)
integer :: N, i, j
allocate (prims(N))
prims = .true.
i = 3
do while (i * i < N)
j = i
do while (j * i < N)
prims(j*i) = .false.
j = j + 2
end do
i = i + 2
end do
prims(1) = .false.
prims(2) = .true.
do i = 4, N, 2
prims(i) = .false.
end do
allocate (array(N))
do i = 1, N
array(i) = i
end do
res = pack(array, prims)
end
end
Timing (147 runs):
Elapsed time: 0.14723 seconds
Edit:
I converted the do whiles to straight dos as per #IanBush comment like this, still no speedup:
do i = 3, sqrt(dble(N)), 2
do j = i, N/i, 2
prims(j*i) = .false.
end do
end do
The Julia implementation:
function do_primes(N)
prims = trues(N)
i = 3
while i * i < N
j = i
while j * i < N
prims[j*i] = false
j = j + 2
end
i = i + 2
end
prims[1] = false
prims[2] = true
prims[4:2:N] .= false
return findall(prims)
end
Timing:
using Benchmarktools
#benchmark do_primes(10^7)
BenchmarkTools.Trial:
memory estimate: 6.26 MiB
allocs estimate: 5
--------------
minimum time: 32.227 ms (0.00% GC)
median time: 32.793 ms (0.00% GC)
mean time: 34.098 ms (3.92% GC)
maximum time: 94.479 ms (65.46% GC)
--------------
samples: 147
evals/sample: 1

Exponent calculation speed

I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.

Is my Theano program actually using the GPU?

Theano claims it's using the GPU; it says what device when it starts up, etc. Furthermore nvidia-smi says it's being used.
But the running time seems to be exactly the same regardless of whether or not I use it.
Could it have something to do with integer arithmetic?
import sys
import numpy as np
import theano
import theano.tensor as T
def ariths(v, ub):
"""Given a sorted vector v and scalar ub, returns multiples of elements in v.
Specifically, returns a vector containing all numbers j * k < ub where j is in
v and k >= j. Some elements may occur more than once in the output.
"""
lp = v[0]
v = T.shape_padright(v)
a = T.shape_padleft(T.arange(0, (ub + lp - 1) // lp - lp, 1, 'int64'))
res = v * (a + v)
return res[(res < ub).nonzero()]
def filter_composites(pv, using_primes):
a = ariths(using_primes, pv.size)
return T.set_subtensor(pv[a], 0)
def _iterfn(prev_bnds, pv):
bstart = prev_bnds[0]
bend = prev_bnds[1]
use_primes = pv[bstart:bend].nonzero()[0] + bstart
pv = filter_composites(pv, use_primes)
return pv
def primes_to(n):
if n <= 2:
return np.asarray([])
elif n <= 3:
return np.asarray([2])
res = T.ones(n, 'int8')
res = T.set_subtensor(res[:2], 0)
ubs = [[2, 4]]
ub = 4
while ub ** 2 < n:
prevub = ub
ub *= 2
ubs.append([prevub, ub])
(r, u5) = theano.scan(fn=_iterfn,
outputs_info=res, sequences=[np.asarray(ubs)])
return r[-1].nonzero()[0]
def main(n):
print(primes_to(n).size.eval())
if __name__ == '__main__':
main(int(sys.argv[1]))
The answer is yes. And no. If you profile your code in a GPU enabled Theano installation using nvprof, you will see something like this:
==16540== Profiling application: python ./theano_test.py
==16540== Profiling result:
Time(%) Time Calls Avg Min Max Name
49.22% 12.096us 1 12.096us 12.096us 12.096us kernel_reduce_ccontig_node_c8d7bd33dfef61705c2854dd1f0cb7ce_0(unsigned int, float const *, float*)
30.60% 7.5200us 3 2.5060us 832ns 5.7600us [CUDA memcpy HtoD]
13.93% 3.4240us 1 3.4240us 3.4240us 3.4240us [CUDA memset]
6.25% 1.5350us 1 1.5350us 1.5350us 1.5350us [CUDA memcpy DtoH]
i.e. There is a least a reduce operation being performed on your GPU. However, if you modify your main like this:
def main():
n = 100000000
print(primes_to(n).size.eval())
if __name__ == '__main__':
import cProfile, pstats
cProfile.run("main()", "{}.profile".format(__file__))
s = pstats.Stats("{}.profile".format(__file__))
s.strip_dirs()
s.sort_stats("time").print_stats(10)
and use cProfile to profile your code, you will see something like this:
Thu Mar 10 14:35:24 2016 ./theano_test.py.profile
486743 function calls (480590 primitive calls) in 17.444 seconds
Ordered by: internal time
List reduced from 1138 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 6.376 6.376 16.655 16.655 {theano.scan_module.scan_perform.perform}
13 6.168 0.474 6.168 0.474 subtensor.py:2084(perform)
27 2.910 0.108 2.910 0.108 {method 'nonzero' of 'numpy.ndarray' objects}
30 0.852 0.028 0.852 0.028 {numpy.core.multiarray.concatenate}
27 0.711 0.026 0.711 0.026 {method 'astype' of 'numpy.ndarray' objects}
13 0.072 0.006 0.072 0.006 {numpy.core.multiarray.arange}
1 0.034 0.034 17.142 17.142 function_module.py:482(__call__)
387 0.020 0.000 0.052 0.000 graph.py:486(stack_search)
77 0.016 0.000 10.731 0.139 op.py:767(rval)
316 0.013 0.000 0.066 0.000 graph.py:715(general_toposort)
The slowest operation (just) is the scan call, and looking at the source for scan, you can see that presently, GPU execution of scan is disabled.
So then answer is, yes, the GPU is being used for something in your code, but no, the most time consuming operation(s) are being run on the CPU because GPU execution appears to be hard disabled in the code at present.

Memoization done, what now?

I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}
I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.

How do I optimize a loop which can be fully strict

I'm trying to write a brute-force solution to Project Euler Problem #145, and I cannot get my solution to run in less than about 1 minute 30 secs.
(I'm aware there are various short-cuts and even paper-and-pencil solutions; for the purpose of this question I'm not considering those).
In the best version I've come up with so far, profiling shows that the majority of the time is spent in foldDigits. This function need not be lazy at all, and to my mind ought to be optimized to a simple loop. As you can see I've attempted to make various bits of the program strict.
So my question is: without changing the overall algorithm, is there some way to bring the execution time of this program down to the sub-minute mark?
(Or if not, is there a way to see that the code of foldDigits is as optimized as possible?)
-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4
{-# LANGUAGE BangPatterns #-}
import Control.Parallel.Strategies
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
| n < 10 = i
| otherwise = foldDigits f i d
where (d, m) = n `quotRem` 10
!i = f acc m
reverseNumber :: Int -> Int
reverseNumber !n
= foldDigits accumulate 0 n
where accumulate !v !d = v * 10 + d
allDigitsOdd :: Int -> Bool
allDigitsOdd n
= foldDigits andOdd True n
where andOdd !a d = a && isOdd d
isOdd !x = x `rem` 2 /= 0
isReversible :: Int -> Bool
isReversible n
= notDivisibleByTen n && allDigitsOdd (n + rn)
where rn = reverseNumber n
notDivisibleByTen !x = x `rem` 10 /= 0
countRange acc start end
| start > end = acc
| otherwise = countRange (acc + v) (start + 1) end
where v = if isReversible start then 1 else 0
main
= print $ sum $ parMap rseq cr ranges
where max = 1000000000
qmax = max `div` 4
ranges = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
cr (s, e) = countRange 0 s e
As it stands, the core that ghc-7.6.1 produces for foldDigits (with -O2) is
Rec {
$wfoldDigits_r2cK
:: forall a_aha.
(a_aha -> GHC.Types.Int -> a_aha)
-> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
\ (# a_aha)
(w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
(w1_s285 :: a_aha)
(ww_s288 :: GHC.Prim.Int#) ->
case w1_s285 of acc_Xhi { __DEFAULT ->
let {
ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
[LclId, Str=DmdType]
ds_sNo =
case GHC.Prim.quotRemInt# ww_s288 10
of _ { (# ipv_aJA, ipv1_aJB #) ->
(GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
} } in
case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
of i_ahg { __DEFAULT ->
case GHC.Prim.<# ww_s288 10 of _ {
GHC.Types.False ->
case ds_sNo of _ { (d_Xsi, m_Xs5) ->
case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
$wfoldDigits_r2cK # a_aha w_s284 i_ahg ww1_X28L
}
};
GHC.Types.True -> i_ahg
}
}
}
end Rec }
which, as you can see, re-boxes the result of the quotRem call. The problem is that no property of f is available here, and as a recursive function, foldDigits cannot be inlined.
With a manual worker-wrapper transform making the function argument static,
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
where
go !acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (f acc r) q
foldDigits becomes inlinable, and you get specialised versions for your uses operating on unboxed data, but no top-level foldDigits, e.g.
Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
\ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
case ww1_s28J of ds_XJh {
__DEFAULT ->
case GHC.Prim.quotRemInt# ds_XJh 10
of _ { (# ipv_aJK, ipv1_aJL #) ->
$wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
};
0 -> ww_s28F
}
end Rec }
and the effect on computation time is tangible, for the original, I got
$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
196,407,088 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1827331 colls, 1827331 par 23.77s 11.86s 0.0000s 0.0041s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 54.94% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 620.52s (313.51s elapsed)
GC time 23.77s ( 11.86s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 644.29s (325.37s elapsed)
Alloc rate 2,923,834,808 bytes per MUT second
(I used -N2 since my i5 only has two physical cores), vs.
$ ./eul145 +RTS -s -N2
608720
16,000,063,624 bytes allocated in the heap
403,384 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 15852 colls, 15852 par 0.34s 0.17s 0.0000s 0.0037s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 43.86% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 314.85s (160.08s elapsed)
GC time 0.34s ( 0.17s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 315.20s (160.25s elapsed)
Alloc rate 50,817,657 bytes per MUT second
Productivity 99.9% of total user, 196.5% of total elapsed
with the modification. The running time roughly halved, and the allocations reduced 100-fold.

Resources