Multiple execution times in python Merge Sort Function - time

I wrote an algorithm for Merge Sort that sorts correctly but I'm sure how to get a single execution time using time.time() - Any advice would be very helpful!
def merge_sort(data):
start_time = time.time()
if len(data) <= 1:
return
mid = len(data) // 2
left_data = data[:mid]
right_data = data[mid:]
merge_sort(left_data)
merge_sort(right_data)
left_index = 0
right_index = 0
data_index = 0
while left_index < len(left_data) and right_index < len(right_data):
if left_data[left_index] < right_data[right_index]:
data[data_index] = left_data[left_index]
left_index += 1
else:
data[data_index] = right_data[right_index]
right_index += 1
data_index += 1
if left_index < len(left_data):
del data[data_index:]
data += left_data[left_index:]
elif right_index < len(right_data):
del data[data_index:]
data += right_data[right_index:]
print('Merge Sort execution time: {}'.format(start_time - time.time()))
return data
OUTPUT:
Merge Sort execution time: 0.0
Merge Sort execution time: 0.0
Merge Sort execution time: -0.015623092651367188
Merge Sort execution time: 0.0
Merge Sort execution time: 0.0
Merge Sort execution time: 0.0
Merge Sort execution time: 0.0
Merge Sort execution time: -0.046871185302734375
Merge Sort execution time: 0.0
Merge Sort execution time: 0.0
Merge Sort execution time: 0.0
Merge Sort execution time: -0.015625476837158203
Merge Sort execution time: 0.0
Merge Sort execution time: 0.0
Merge Sort execution time: 0.0
Merge Sort execution time: -0.015624284744262695
Merge Sort execution time: -0.0312497615814209
Merge Sort execution time: -0.07812094688415527

Related

How to optimize a function and minimize allocations

The following function generates primes up to N. For large N, this becomes quite slow, my Julia implementation is 5X faster for N = 10**7. I guess the creation of a large integer array and using pack to collect the result is the slowest part. I tried counting .true.s first, then allocating res(:) and populating it using a loop, but the speedup was negligible (4%) as I iterate the prims array twice in this case. In Julia, I used findall which does exactly what I did; iterating the array twice, first counting trues and allocationg result then populating it. Any ideas? Thank you.
Compiler:
Intel(R) Visual Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.3.210 Build 20180410 (on Windows 10)
Options: ifort -warn /O3 -heap-arrays:8000000
program main
implicit none
integer, allocatable :: primes(:)
integer :: t0, t1, count_rate, count_max
call system_clock(t0, count_rate, count_max)
primes = do_primes(10**7)
call system_clock(t1)
print '(a,f7.5,a)', 'Elapsed time: ', real(t1-t0)/count_rate, ' seconds'
print *, primes(1:10)
contains
function do_primes(N) result (res)
integer, allocatable :: res(:), array(:)
logical, allocatable :: prims(:)
integer :: N, i, j
allocate (prims(N))
prims = .true.
i = 3
do while (i * i < N)
j = i
do while (j * i < N)
prims(j*i) = .false.
j = j + 2
end do
i = i + 2
end do
prims(1) = .false.
prims(2) = .true.
do i = 4, N, 2
prims(i) = .false.
end do
allocate (array(N))
do i = 1, N
array(i) = i
end do
res = pack(array, prims)
end
end
Timing (147 runs):
Elapsed time: 0.14723 seconds
Edit:
I converted the do whiles to straight dos as per #IanBush comment like this, still no speedup:
do i = 3, sqrt(dble(N)), 2
do j = i, N/i, 2
prims(j*i) = .false.
end do
end do
The Julia implementation:
function do_primes(N)
prims = trues(N)
i = 3
while i * i < N
j = i
while j * i < N
prims[j*i] = false
j = j + 2
end
i = i + 2
end
prims[1] = false
prims[2] = true
prims[4:2:N] .= false
return findall(prims)
end
Timing:
using Benchmarktools
#benchmark do_primes(10^7)
BenchmarkTools.Trial:
memory estimate: 6.26 MiB
allocs estimate: 5
--------------
minimum time: 32.227 ms (0.00% GC)
median time: 32.793 ms (0.00% GC)
mean time: 34.098 ms (3.92% GC)
maximum time: 94.479 ms (65.46% GC)
--------------
samples: 147
evals/sample: 1

Why rayon-based parallel processing takes more time than serial processing?

Learning Rayon, I wanted to compare the performace of parallel calculation and serial calculation of Fibonacci series. Here's my code:
use rayon;
use std::time::Instant;
fn main() {
let nth = 30;
let now = Instant::now();
let fib = fibonacci_serial(nth);
println!(
"[s] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
let now = Instant::now();
let fib = fibonacci_parallel(nth);
println!(
"[p] The {}th number in the fibonacci sequence is {}, elapsed: {}",
nth,
fib,
now.elapsed().as_micros()
);
}
fn fibonacci_parallel(n: u64) -> u64 {
if n <= 1 {
return n;
}
let (a, b) = rayon::join(|| fibonacci_parallel(n - 2), || fibonacci_parallel(n - 1));
a + b
}
fn fibonacci_serial(n: u64) -> u64 {
if n <= 1 {
return n;
}
fibonacci_serial(n - 2) + fibonacci_serial(n - 1)
}
Run in Rust Playground
I expected the elapsed time of parallel calculation would be smaller than the elapsed time of serial caculation, but the result was opposite:
# `s` stands for serial calculation and `p` for parallel
[s] The 30th number in the fibonacci sequence is 832040, elapsed: 12127
[p] The 30th number in the fibonacci sequence is 832040, elapsed: 990379
My implementation for serial/parallel calculation would have flaws. But if not, why am I seeing these results?
I think the real reason is, that you create n² threads which is not good. In every call of fibonacci_parallel you create another pair of threads for rayon and because you call fibonacci_parallel again in the closure you create yet another pair of threads.
This is utterly terrible for the OS/rayon.
An approach to solve this problem could be this:
fn fibonacci_parallel(n: u64) -> u64 {
fn inner(n: u64) -> u64 {
if n <= 1 {
return n;
}
inner(n - 2) + inner(n - 1)
}
if n <= 1 {
return n;
}
let (a, b) = rayon::join(|| inner(n - 2), || inner(n - 1));
a + b
}
You create two threads which both execute the inner function. With this addition I get
op#VBOX /t/t/foo> cargo run --release 40
Finished release [optimized] target(s) in 0.03s
Running `target/release/foo 40`
[s] The 40th number in the fibonacci sequence is 102334155, elapsed: 1373741
[p] The 40th number in the fibonacci sequence is 102334155, elapsed: 847343
But as said, for low numbers parallel execution is not worth it:
op#VBOX /t/t/foo> cargo run --release 20
Finished release [optimized] target(s) in 0.02s
Running `target/release/foo 20`
[s] The 10th number in the fibonacci sequence is 6765, elapsed: 82
[p] The 10th number in the fibonacci sequence is 6765, elapsed: 241

Exponent calculation speed

I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.

Memory allocation in a fixed point algorithm

I need to find the fixed point of a function f. The algorithm is very simple:
Given X, compute f(X)
If ||X-f(X)|| is lower than a certain tolerance, exit and return X,
otherwise set X equal to f(X) and go back to 1.
I'd like to be sure I'm not allocating memory for a new object at every iteration
For now, the algorithm looks like this:
iter1 = function(x::Vector{Float64})
for iter in 1:max_it
oldx = copy(x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
Here g1(x) is a function that sets x to f(x)
But it seems this loop allocates a new vector at every loop (see below).
Another way to write the algorithm is the following:
iter2 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
(oldx, x) = (x, oldx)
g2(x, oldx)
delta = vnormdiff(oldx, x, 2)
if delta < tolerance
break
end
end
end
where g2(x1, x2) is a function that sets x1 to f(x2).
Is thi the most efficient and natural way to write this kind of iteration problem?
Edit1: timing shows that the second code is faster:
using NumericExtensions
max_it = 1000
tolerance = 1e-8
max_it = 100
g1 = function(x::Vector{Float64})
for i in 1:length(x)
x[i] = x[i]/2
end
end
g2 = function(newx::Vector{Float64}, x::Vector{Float64})
for i in 1:length(x)
newx[i] = x[i]/2
end
end
x = fill(1e7, int(1e7))
#time iter1(x)
# elapsed time: 4.688103075 seconds (4960117840 bytes allocated, 29.72% gc time)
x = fill(1e7, int(1e7))
#time iter2(x)
# elapsed time: 2.187916177 seconds (80199676 bytes allocated, 0.74% gc time)
Edit2: using copy!
iter3 = function(x::Vector{Float64})
oldx = similar(x)
for iter in 1:max_it
copy!(oldx, x)
g1(x)
delta = vnormdiff(x, oldx, 2)
if delta < tolerance
break
end
end
end
x = fill(1e7, int(1e7))
#time iter3(x)
# elapsed time: 2.745350176 seconds (80008088 bytes allocated, 1.11% gc time)
I think replacing the following lines in the first code
for iter = 1:max_it
oldx = copy( x )
...
by
oldx = zeros( N )
for iter = 1:max_it
oldx[:] = x # or copy!( oldx, x )
...
will be more efficient because no array is allocated. Also, the code can be made more efficient by writing for-loops explicitly. This can be seen, for example, from the following comparison
function test()
N = 1000000
a = zeros( N )
b = zeros( N )
#time c = copy( a )
#time b[:] = a
#time copy!( b, a )
#time \
for i = 1:length(a)
b[i] = a[i]
end
#time \
for i in eachindex(a)
b[i] = a[i]
end
end
test()
The result obtained with Julia0.4.0 on Linux(x86_64) is
elapsed time: 0.003955609 seconds (7 MB allocated)
elapsed time: 0.001279142 seconds (0 bytes allocated)
elapsed time: 0.000836167 seconds (0 bytes allocated)
elapsed time: 1.19e-7 seconds (0 bytes allocated)
elapsed time: 1.28e-7 seconds (0 bytes allocated)
It seems that copy!() is faster than using [:] in the left-hand side,
though the difference becomes marginal in repeated calculations (there seems to be
some overhead for the first [:] calculation). Btw, the last example using eachindex() is very convenient for looping over multi-dimensional arrays.
Similar comparison can be made for vnormdiff(), where use of norm( x - oldx ) etc is slower than an explicit loop for vector norm, because the former allocates one temporary array for x - oldx.

Scala recursion vs loop: performance and runtime considerations

I've wrote a naïve test-bed to measure the performance of three kinds of factorial implementation: loop based, non tail-recursive and tail-recursive.
Surprisingly to me the worst performant was the loop ones («while» was expected to be more efficient so I provided both) that cost
almost twice than the tail recursive alternative.
*ANSWER: fixing the loop implementation avoiding the = operator which outperform worst with BigInt due to its internals «loops» became fastest as expected
Another «woodoo» behavior I've experienced was the StackOverflow
exception which wasn't thrown systematically for the same input in the
case of non-tail recursive implementation. I can circumvent the
StackOverlow by progressively call the function with larger and larger
values… I feel crazy :) Answer: JVM require to converge during startup, then behavior is coherent and systematic
This is the code:
final object Factorial {
type Out = BigInt
def calculateByRecursion(n: Int): Out = {
require(n>0, "n must be positive")
n match {
case _ if n == 1 => return 1
case _ => return n * calculateByRecursion(n-1)
}
}
def calculateByForLoop(n: Int): Out = {
require(n>0, "n must be positive")
var accumulator: Out = 1
for (i <- 1 to n)
accumulator = i * accumulator
accumulator
}
def calculateByWhileLoop(n: Int): Out = {
require(n>0, "n must be positive")
var accumulator: Out = 1
var i = 1
while (i <= n) {
accumulator = i * accumulator
i += 1
}
accumulator
}
def calculateByTailRecursion(n: Int): Out = {
require(n>0, "n must be positive")
#tailrec def fac(n: Int, acc: Out): Out = n match {
case _ if n == 1 => acc
case _ => fac(n-1, n * acc)
}
fac(n, 1)
}
def calculateByTailRecursionUpward(n: Int): Out = {
require(n>0, "n must be positive")
#tailrec def fac(i: Int, acc: Out): Out = n match {
case _ if i == n => n * acc
case _ => fac(i+1, i * acc)
}
fac(1, 1)
}
def comparePerformance(n: Int) {
def showOutput[A](msg: String, data: (Long, A), showOutput:Boolean = false) =
showOutput match {
case true => printf("%s returned %s in %d ms\n", msg, data._2.toString, data._1)
case false => printf("%s in %d ms\n", msg, data._1)
}
def measure[A](f:()=>A): (Long, A) = {
val start = System.currentTimeMillis
val o = f()
(System.currentTimeMillis - start, o)
}
showOutput ("By for loop", measure(()=>calculateByForLoop(n)))
showOutput ("By while loop", measure(()=>calculateByWhileLoop(n)))
showOutput ("By non-tail recursion", measure(()=>calculateByRecursion(n)))
showOutput ("By tail recursion", measure(()=>calculateByTailRecursion(n)))
showOutput ("By tail recursion upward", measure(()=>calculateByTailRecursionUpward(n)))
}
}
What follows is some output from sbt console (Before «while» implementation):
scala> example.Factorial.comparePerformance(10000)
By loop in 3 ns
By non-tail recursion in >>>>> StackOverflow!!!!!… see later!!!
........
scala> example.Factorial.comparePerformance(1000)
By loop in 3 ms
By non-tail recursion in 1 ms
By tail recursion in 4 ms
scala> example.Factorial.comparePerformance(5000)
By loop in 105 ms
By non-tail recursion in 27 ms
By tail recursion in 34 ms
scala> example.Factorial.comparePerformance(10000)
By loop in 236 ms
By non-tail recursion in 106 ms >>>> Now works!!!
By tail recursion in 127 ms
scala> example.Factorial.comparePerformance(20000)
By loop in 977 ms
By non-tail recursion in 495 ms
By tail recursion in 564 ms
scala> example.Factorial.comparePerformance(30000)
By loop in 2285 ms
By non-tail recursion in 1183 ms
By tail recursion in 1281 ms
What follows is some output from sbt console (After «while» implementation):
scala> example.Factorial.comparePerformance(10000)
By for loop in 252 ms
By while loop in 246 ms
By non-tail recursion in 130 ms
By tail recursion in 136 ns
scala> example.Factorial.comparePerformance(20000)
By for loop in 984 ms
By while loop in 1091 ms
By non-tail recursion in 508 ms
By tail recursion in 560 ms
What follows is some output from sbt console (after «upward» tail recursion implementation) the world come back sane:
scala> example.Factorial.comparePerformance(10000)
By for loop in 259 ms
By while loop in 229 ms
By non-tail recursion in 114 ms
By tail recursion in 119 ms
By tail recursion upward in 105 ms
scala> example.Factorial.comparePerformance(20000)
By for loop in 1053 ms
By while loop in 957 ms
By non-tail recursion in 513 ms
By tail recursion in 565 ms
By tail recursion upward in 470 ms
What follows is some output from sbt console after fixing BigInt multiplication in «loops»: the world is totally sane:
scala> example.Factorial.comparePerformance(20000)
By for loop in 498 ms
By while loop in 502 ms
By non-tail recursion in 521 ms
By tail recursion in 611 ms
By tail recursion upward in 503 ms
BigInt overhead and a stupid implementation by me masked the expected behavior.
PS.: In the end I should re-title this post to «A lernt lesson on BigInts»
For loops are not actually quite loops; they're for comprehensions on a range. If you actually want a loop, you need to use while. (Actually, I think the BigInt multiplication here is heavyweight enough so it shouldn't matter. But you'll notice if you're multiplying Ints.)
Also, you have confused yourself by using BigInt. The bigger your BigInt is, the slower your multiplication. So your non-tail loop counts up while your tail recursion loop counds down which means that the latter has more big numbers to multiply.
If you fix these two issues you will find that sanity is restored: loops and tail recursion are the same speed, with both regular recursion and for slower. (Regular recursion may not be slower if the JVM optimization makes it equivalent)
(Also, the stack overflow fix is probably because the JVM starts inlining and may either make the call tail-recursive itself, or unrolls the loop far enough so that you don't overflow any longer.)
Finally, you're getting poor results with for and while because you're multiplying on the right rather than the left with the small number. It turns out that the Java's BigInt multiplies faster with the smaller number on the left.
Scala static methods for factorial(n) (coded with scala 2.12.x, java-8):
object Factorial {
/*
* For large N, it throws a stack overflow
*/
def recursive(n:BigInt): BigInt = {
if(n < 0) {
throw new ArithmeticException
} else if(n <= 1) {
1
} else {
n * recursive(n - 1)
}
}
/*
* A tail recursive method is compiled to avoid stack overflow
*/
#scala.annotation.tailrec
def recursiveTail(n:BigInt, acc:BigInt = 1): BigInt = {
if(n < 0) {
throw new ArithmeticException
} else if(n <= 1) {
acc
} else {
recursiveTail(n - 1, n * acc)
}
}
/*
* A while loop
*/
def loop(n:BigInt): BigInt = {
if(n < 0) {
throw new ArithmeticException
} else if(n <= 1) {
1
} else {
var acc = 1
var idx = 1
while(idx <= n) {
acc = idx * acc
idx += 1
}
acc
}
}
}
Specs:
class FactorialSpecs extends SpecHelper {
private val smallInt = 10
private val largeInt = 10000
describe("Factorial.recursive") {
it("return 1 for 0") {
assert(Factorial.recursive(0) == 1)
}
it("return 1 for 1") {
assert(Factorial.recursive(1) == 1)
}
it("return 2 for 2") {
assert(Factorial.recursive(2) == 2)
}
it("returns a result, for small inputs") {
assert(Factorial.recursive(smallInt) == 3628800)
}
it("throws StackOverflow for large inputs") {
intercept[java.lang.StackOverflowError] {
Factorial.recursive(Int.MaxValue)
}
}
}
describe("Factorial.recursiveTail") {
it("return 1 for 0") {
assert(Factorial.recursiveTail(0) == 1)
}
it("return 1 for 1") {
assert(Factorial.recursiveTail(1) == 1)
}
it("return 2 for 2") {
assert(Factorial.recursiveTail(2) == 2)
}
it("returns a result, for small inputs") {
assert(Factorial.recursiveTail(smallInt) == 3628800)
}
it("returns a result, for large inputs") {
assert(Factorial.recursiveTail(largeInt).isInstanceOf[BigInt])
}
}
describe("Factorial.loop") {
it("return 1 for 0") {
assert(Factorial.loop(0) == 1)
}
it("return 1 for 1") {
assert(Factorial.loop(1) == 1)
}
it("return 2 for 2") {
assert(Factorial.loop(2) == 2)
}
it("returns a result, for small inputs") {
assert(Factorial.loop(smallInt) == 3628800)
}
it("returns a result, for large inputs") {
assert(Factorial.loop(largeInt).isInstanceOf[BigInt])
}
}
}
Benchmarks:
import org.scalameter.api._
class BenchmarkFactorials extends Bench.OfflineReport {
val gen: Gen[Int] = Gen.range("N")(1, 1000, 100) // scalastyle:ignore
performance of "Factorial" in {
measure method "loop" in {
using(gen) in {
n => Factorial.loop(n)
}
}
measure method "recursive" in {
using(gen) in {
n => Factorial.recursive(n)
}
}
measure method "recursiveTail" in {
using(gen) in {
n => Factorial.recursiveTail(n)
}
}
}
}
Benchmark results (loop is much faster):
[info] Test group: Factorial.loop
[info] - Factorial.loop.Test-9 measurements:
[info] - at N -> 1: passed
[info] (mean = 0.01 ms, ci = <0.00 ms, 0.02 ms>, significance = 1.0E-10)
[info] - at N -> 101: passed
[info] (mean = 0.01 ms, ci = <0.01 ms, 0.01 ms>, significance = 1.0E-10)
[info] - at N -> 201: passed
[info] (mean = 0.02 ms, ci = <0.02 ms, 0.02 ms>, significance = 1.0E-10)
[info] - at N -> 301: passed
[info] (mean = 0.03 ms, ci = <0.02 ms, 0.03 ms>, significance = 1.0E-10)
[info] - at N -> 401: passed
[info] (mean = 0.03 ms, ci = <0.03 ms, 0.04 ms>, significance = 1.0E-10)
[info] - at N -> 501: passed
[info] (mean = 0.04 ms, ci = <0.03 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 601: passed
[info] (mean = 0.04 ms, ci = <0.04 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 701: passed
[info] (mean = 0.05 ms, ci = <0.05 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 801: passed
[info] (mean = 0.06 ms, ci = <0.05 ms, 0.06 ms>, significance = 1.0E-10)
[info] - at N -> 901: passed
[info] (mean = 0.06 ms, ci = <0.05 ms, 0.07 ms>, significance = 1.0E-10)
[info] Test group: Factorial.recursive
[info] - Factorial.recursive.Test-10 measurements:
[info] - at N -> 1: passed
[info] (mean = 0.00 ms, ci = <0.00 ms, 0.01 ms>, significance = 1.0E-10)
[info] - at N -> 101: passed
[info] (mean = 0.05 ms, ci = <0.01 ms, 0.09 ms>, significance = 1.0E-10)
[info] - at N -> 201: passed
[info] (mean = 0.03 ms, ci = <0.02 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 301: passed
[info] (mean = 0.07 ms, ci = <0.00 ms, 0.13 ms>, significance = 1.0E-10)
[info] - at N -> 401: passed
[info] (mean = 0.09 ms, ci = <0.01 ms, 0.18 ms>, significance = 1.0E-10)
[info] - at N -> 501: passed
[info] (mean = 0.10 ms, ci = <0.03 ms, 0.17 ms>, significance = 1.0E-10)
[info] - at N -> 601: passed
[info] (mean = 0.11 ms, ci = <0.08 ms, 0.15 ms>, significance = 1.0E-10)
[info] - at N -> 701: passed
[info] (mean = 0.13 ms, ci = <0.11 ms, 0.14 ms>, significance = 1.0E-10)
[info] - at N -> 801: passed
[info] (mean = 0.16 ms, ci = <0.13 ms, 0.19 ms>, significance = 1.0E-10)
[info] - at N -> 901: passed
[info] (mean = 0.21 ms, ci = <0.15 ms, 0.27 ms>, significance = 1.0E-10)
[info] Test group: Factorial.recursiveTail
[info] - Factorial.recursiveTail.Test-11 measurements:
[info] - at N -> 1: passed
[info] (mean = 0.00 ms, ci = <0.00 ms, 0.01 ms>, significance = 1.0E-10)
[info] - at N -> 101: passed
[info] (mean = 0.04 ms, ci = <0.03 ms, 0.05 ms>, significance = 1.0E-10)
[info] - at N -> 201: passed
[info] (mean = 0.12 ms, ci = <0.05 ms, 0.20 ms>, significance = 1.0E-10)
[info] - at N -> 301: passed
[info] (mean = 0.16 ms, ci = <-0.03 ms, 0.34 ms>, significance = 1.0E-10)
[info] - at N -> 401: passed
[info] (mean = 0.12 ms, ci = <0.09 ms, 0.16 ms>, significance = 1.0E-10)
[info] - at N -> 501: passed
[info] (mean = 0.17 ms, ci = <0.15 ms, 0.19 ms>, significance = 1.0E-10)
[info] - at N -> 601: passed
[info] (mean = 0.23 ms, ci = <0.19 ms, 0.26 ms>, significance = 1.0E-10)
[info] - at N -> 701: passed
[info] (mean = 0.25 ms, ci = <0.18 ms, 0.32 ms>, significance = 1.0E-10)
[info] - at N -> 801: passed
[info] (mean = 0.28 ms, ci = <0.21 ms, 0.36 ms>, significance = 1.0E-10)
[info] - at N -> 901: passed
[info] (mean = 0.32 ms, ci = <0.17 ms, 0.46 ms>, significance = 1.0E-10)
I know everyone already answered the question, but I thought that I might add this one optimization: If you convert the pattern matching to simple if-statements, it can speed up the tail recursion.
final object Factorial {
type Out = BigInt
def calculateByRecursion(n: Int): Out = {
require(n>0, "n must be positive")
n match {
case _ if n == 1 => return 1
case _ => return n * calculateByRecursion(n-1)
}
}
def calculateByForLoop(n: Int): Out = {
require(n>0, "n must be positive")
var accumulator: Out = 1
for (i <- 1 to n)
accumulator = i * accumulator
accumulator
}
def calculateByWhileLoop(n: Int): Out = {
require(n>0, "n must be positive")
var acc: Out = 1
var i = 1
while (i <= n) {
acc = i * acc
i += 1
}
acc
}
def calculateByTailRecursion(n: Int): Out = {
require(n>0, "n must be positive")
#annotation.tailrec
def fac(n: Int, acc: Out): Out = if (n==1) acc else fac(n-1, n*acc)
fac(n, 1)
}
def calculateByTailRecursionUpward(n: Int): Out = {
require(n>0, "n must be positive")
#annotation.tailrec
def fac(i: Int, acc: Out): Out = if (i == n) n*acc else fac(i+1, i*acc)
fac(1, 1)
}
def attempt(f: ()=>Unit): Boolean = {
try {
f()
true
} catch {
case _: Throwable =>
println(" <<<<< Failed...")
false
}
}
def comparePerformance(n: Int) {
def showOutput[A](msg: String, data: (Long, A), showOutput:Boolean = true) =
showOutput match {
case true =>
val res = data._2.toString
val pref = res.substring(0,5)
val midd = res.substring((res.length-5)/ 2, (res.length-5)/ 2 + 10)
val suff = res.substring(res.length-5)
printf("%s returned %s in %d ms\n", msg, s"$pref...$midd...$suff" , data._1)
case false =>
printf("%s in %d ms\n", msg, data._1)
}
def measure[A](f:()=>A): (Long, A) = {
val start = System.currentTimeMillis
val o = f()
(System.currentTimeMillis - start, o)
}
attempt(() => showOutput ("By for loop", measure(()=>calculateByForLoop(n))))
attempt(() => showOutput ("By while loop", measure(()=>calculateByWhileLoop(n))))
attempt(() => showOutput ("By non-tail recursion", measure(()=>calculateByRecursion(n))))
attempt(() => showOutput ("By tail recursion", measure(()=>calculateByTailRecursion(n))))
attempt(() => showOutput ("By tail recursion upward", measure(()=>calculateByTailRecursionUpward(n))))
}
}
My results:
scala> Factorial.comparePerformance(20000)
By for loop returned 18192...5708616582...00000 in 179 ms
By while loop returned 18192...5708616582...00000 in 159 ms
By non-tail recursion <<<<< Failed...
By tail recursion returned 18192...5708616582...00000 in 169 ms
By tail recursion upward returned 18192...5708616582...00000 in 174 ms
By for loop returned 18192...5708616582...00000 in 212 ms
By while loop returned 18192...5708616582...00000 in 156 ms
By non-tail recursion returned 18192...5708616582...00000 in 155 ms
By tail recursion returned 18192...5708616582...00000 in 166 ms
By tail recursion upward returned 18192...5708616582...00000 in 137 ms
scala> Factorial.comparePerformance(200000)
By for loop returned 14202...0169293868...00000 in 17467 ms
By while loop returned 14202...0169293868...00000 in 17303 ms
By non-tail recursion <<<<< Failed...
By tail recursion returned 14202...0169293868...00000 in 18477 ms
By tail recursion upward returned 14202...0169293868...00000 in 17188 ms

Resources