I have something like this (simple example):
using BenchmarkTools
function assign()
e = zeros(100, 90000)
e2 = ones(100) * 0.16
e[:, 100:end] .= e2[:]
end
#benchmark assign()
and need to this for thousands of time steps. This gives
BenchmarkTools.Trial:
memory estimate: 68.67 MiB
allocs estimate: 6
--------------
minimum time: 16.080 ms (0.00% GC)
median time: 27.811 ms (0.00% GC)
mean time: 31.822 ms (12.31% GC)
maximum time: 43.439 ms (27.66% GC)
--------------
samples: 158
evals/sample: 1
Is there a faster way of doing this?
First of all I will assume that you meant
function assign1()
e = zeros(100, 90000)
e2 = ones(100) * 0.16
e[:, 100:end] .= e2[:]
return e # <- important!
end
Since otherwise you will not return the first 99 columns of e(!):
julia> size(assign())
(100, 89901)
Secondly, don't do this:
e[:, 100:end] .= e2[:]
e2[:] makes a copy of e2 and assigns that, but why? Just assign e2 directly:
e[:, 100:end] .= e2
Ok, but let's try a few different versions. Notice that there is no need to make e2 a vector, just assign a scalar:
function assign2()
e = zeros(100, 90000)
e[:, 100:end] .= 0.16 # Just broadcast a scalar!
return e
end
function assign3()
e = fill(0.16, 100, 90000) # use fill instead of writing all those zeros that you will throw away
e[:, 1:99] .= 0
return e
end
function assign4()
# only write exactly the values you need!
e = Matrix{Float64}(undef, 100, 90000)
e[:, 1:99] .= 0
e[:, 100:end] .= 0.16
return e
end
Time to benchmark
julia> #btime assign1();
14.550 ms (5 allocations: 68.67 MiB)
julia> #btime assign2();
14.481 ms (2 allocations: 68.66 MiB)
julia> #btime assign3();
9.636 ms (2 allocations: 68.66 MiB)
julia> #btime assign4();
10.062 ms (2 allocations: 68.66 MiB)
Versions 1 and 2 are equally fast, but you'll notice that there are 2 allocations instead of 5, but, of course, the big allocation dominates.
Versions 3 and 4 are faster, not dramatically so, but you see that it avoids some duplicate work, such as writing values into the matrix twice. Version 3 is the fastest, not by much, but this changes if the assignment is a bit more balanced, in which case version 4 is faster:
function assign3_()
e = fill(0.16, 100, 90000)
e[:, 1:44999] .= 0
return e
end
function assign4_()
e = Matrix{Float64}(undef, 100, 90000)
e[:, 1:44999] .= 0
e[:, 45000:end] .= 0.16
return e
end
julia> #btime assign3_();
11.576 ms (2 allocations: 68.66 MiB)
julia> #btime assign4_();
8.658 ms (2 allocations: 68.66 MiB)
The lesson is to avoid doing unnecessary work.
Related
In Julia, I would like to randomly generate an array of arbitrary size, where all the elements of the array are complex numbers with absolute value one. Is there perhaps any way to do this within Julia?
I've got four options so far:
f1(n) = exp.((2*im*π).*rand(n))
f2(n) = map(x->(z = x[1]+im*x[2] ; z ./ abs(z) ),
eachcol(randn(2,n)))
f3(n) = [im*x[1]+x[2] for x in sincos.(2π*rand(n))]
f4(n) = cispi.(2 .*rand(n))
We have:
julia> using BenchmarkTools
julia> begin
#btime f1(1_000);
#btime f2(1_000);
#btime f3(1_000);
#btime f4(1_000);
end;
29.390 μs (2 allocations: 23.69 KiB)
15.559 μs (2 allocations: 31.50 KiB)
25.733 μs (4 allocations: 47.38 KiB)
27.662 μs (2 allocations: 23.69 KiB)
Not a crucial difference.
One way is:
randcomplex() = (c = Complex(rand(2)...); c / abs(c))
randcomplex(numwanted) = [randcomplex() for _ in 1:numwanted]
or
randcomplex(dims...) = (a = zeros(Complex, dims...); for i in eachindex(a) a[i] = randcomplex() end; a)
If you are looking for something faster, here are two options. They return a perhaps slightly unfamiliar type, but it is equivalent to a regular Vector
function f5(n)
r = rand(2, n)
for i in 1:n
a = sqrt(r[1, i]^2 + r[2, i]^2)
r[1, i] /= a
r[2, i] /= a
end
return reinterpret(reshape, ComplexF64, r)
end
using LoopVectorization: #turbo
function f5t(n)
r = rand(2, n)
#turbo for i in 1:n
a = sqrt(r[1, i]^2 + r[2, i]^2)
r[1, i] /= a
r[2, i] /= a
end
return reinterpret(reshape, ComplexF64, r)
end
julia> #btime f5(1000);
4.186 μs (1 allocation: 15.75 KiB)
julia> #btime f5t(1000);
2.900 μs (1 allocation: 15.75 KiB)
The matrix Y is defined as
Y = cumsum(cumsum(X,dims=1), dims=2)
For example,
julia> X = [1 4 2 3; 2 4 5 2; 4 3 4 1; 2 5 4 2];
julia> Y = cumsum(cumsum(X,dims=1), dims=2)
4x4 Matrix{Int64}:
1 5 7 10
3 11 18 23
7 18 29 35
9 25 40 48
I want to reproduce the matrix X from Y. It seems that function diff is helpful. However, as you can see below, we cannot reproduce the first line and first column of X.
julia> diff(diff(y, dims=1), dims=2)
3x3 Matrix{Int64}:
4 5 2
3 4 1
5 4 2
So, I concatenate zeros. Then, it works.
julia> y00 = vcat(zeros(5)',hcat(zeros(4), y))
5x5 Matrix{Int64}:
0 0 0 0 0
0 1 5 7 10
0 3 11 18 23
0 7 18 29 35
0 9 25 40 48
julia> diff(diff(y00, dims=1), dims=2)
4x4 Matrix{Int64}:
1 5 7 10
3 11 18 23
7 18 29 35
9 25 40 48
But I think concatenating takes time and memory.
Is there any better idea to reproduce X from Y?
Context
I want to expand the above matrices X and Y to any dimensional array. For example, I want to reconstruct a three-dimensional array X from given three-dimensional array
Y = cumsum( cumsum( cumsum(X, dims=1), dims=2), dims=3)
When both speed and succinctness are required, it's hard to beat powerful Julia packages like Tullio.jl. Here is a one-liner that's about 4X faster than the fastest solution by #DanGetz.
using Tullio
cumdiff(Y) = #tullio X[i,j] = Y[i,j] - Y[i,j-1] - Y[i-1,j] + Y[i-1,j-1]
Benchmarking with a 100-by-100 matrix gives:
X = rand(0:100,100,100)
Y = cumsum(cumsum(X,dims=1), dims=2)
#btime cumdiff($Y)
#btime decumsum3($Y)
4.957 μs (17 allocations: 464 bytes)
21.300 μs (2 allocations: 78.17 KiB)
Fix: The code above was using the predefined X instead of creating a new one. This is fixed below, and the speedup is more like 3.5X and not 4X.
function cumdiff(Y)
X = similar(Y)
X[1] = Y[1]
for i = 2:size(Y,1) X[i,1] = Y[i,1] - Y[i-1,1] end
for j = 2:size(Y,2) X[1,j] = Y[1,j] - Y[1,j-1] end
#tullio X[i,j] = Y[i,j] - Y[i,j-1] - Y[i-1,j] + Y[i-1,j-1]
end
#btime cumdiff($Y)
#btime decumsum3($Y)
6.000 μs (4 allocations: 78.23 KiB)
21.300 μs (2 allocations: 78.17 KiB)
See EDIT section below.
Some options so far:
decumsum1(X) = begin
Z = copy(X)
Z[2:end,:] .-= Z[1:end-1,:]
Z[:,2:end] .-= Z[:,1:end-1]
return Z
end
decumsum2(X) = begin # This is from question #
r,c = size(X)
Z = vcat(zeros(eltype(X),r+1)',
hcat(zeros(eltype(X),c), X))
return diff(diff(Z, dims=1), dims=2)
end
decumsum3(Y) = [Y[I]-(I[2]==1 ? 0 : Y[I[1],I[2]-1])-
(I[1]==1 ? 0 : Y[I[1]-1,I[2]])+
((I[1]==1 || I[2]==1) ? 0 : Y[I[1]-1,I[2]-1])
for I in CartesianIndices(Y)]
function decumsum5(Y)
R = similar(Y)
h,w = size(Y)
R[1,1] = Y[1,1]
#inbounds for i=2:h R[i,1] = Y[i,1]-Y[i-1,1] ; end
#inbounds for j=2:w R[1,j] = Y[1,j]-Y[1,j-1] ; end
#inbounds for i=2:h,j=2:w R[i,j] = Y[i,j]-Y[i-1,j]-Y[i,j-1]+Y[i-1,j-1] ; end
return R
end
Giving the following benchmarks:
julia> using BenchmarkTools
julia> decumsum1(Y) == decumsum2(Y) == decumsum3(Y) == X
true
julia> #btime decumsum1($Y);
352.571 ns (5 allocations: 832 bytes)
julia> #btime decumsum2($Y);
475.438 ns (9 allocations: 1.14 KiB)
julia> #btime decumsum3($Y);
96.875 ns (1 allocation: 192 bytes)
julia> #btime decumsum5($Y);
60.805 ns (1 allocation: 192 bytes)
EDIT: Perhaps the prettier solutions is:
decumsum(Y; dims) = [Y[I] - (
I[dims]==1 ? 0 : Y[(ifelse(k == dims,I[k]-1,I[k])
for k in 1:ndims(Y))...]
) for I in CartesianIndices(Y)]
and with it, the cumsum can be walked back:
julia> decumsum(decumsum(Y, dims=1), dims=2)
4×4 Matrix{Int64}:
1 4 2 3
2 4 5 2
4 3 4 1
2 5 4 2
julia> decumsum(decumsum(Y, dims=1), dims=2) == X
true
julia> #btime decumsum(decumsum($Y, dims=1), dims=2);
165.656 ns (2 allocations: 384 bytes)
with nice performance and also generalized to any Array dimension.
Update: another version decumsum5 added. Still faster.
I have a general question. I've got a Julia programme that needs to use a random number each time it iterates through a for loop. I'm wondering is there any performance benefits to be gain if I make batches of random numbers before the loop and store them in an array calling these pre-made random numbers instead of generating them on the fly? And, if so, is there an optimum batch size?
As Peter O. commented it depends. But let me give you an example where batching is desired:
julia> using Random, BenchmarkTools
julia> function f1()
x = Vector{Float64}(undef, 10^6)
y = zeros(10^6)
for i in 1:100
rand!(x)
y .+= x
end
return y
end
f1 (generic function with 1 method)
julia> function f2()
y = zeros(10^6)
#inbounds for i in 1:100
#simd for j in 1:10^6
y[j] += rand()
end
end
return y
end
f2 (generic function with 1 method)
julia> function f3()
y = zeros(10^6)
#inbounds for i in 1:100
for j in 1:10^6
y[j] += rand()
end
end
return y
end
f3 (generic function with 1 method)
julia> function f4()
x = Vector{Float64}(undef, 10^6)
y = zeros(10^6)
#inbounds for i in 1:100
rand!(x)
#simd for j in 1:10^6
y[j] += x[j]
end
end
return y
end
f4 (generic function with 1 method)
julia> function f5()
x = Vector{Float64}(undef, 10^6)
y = zeros(10^6)
#inbounds for i in 1:100
rand!(x)
for j in 1:10^6
y[j] += x[j]
end
end
return y
end
f5 (generic function with 1 method)
julia> #btime f1();
171.816 ms (4 allocations: 15.26 MiB)
julia> #btime f2();
370.950 ms (2 allocations: 7.63 MiB)
julia> #btime f3();
412.871 ms (2 allocations: 7.63 MiB)
julia> #btime f4();
172.355 ms (4 allocations: 15.26 MiB)
julia> #btime f5();
174.676 ms (4 allocations: 15.26 MiB)
As you can see f1 (and two variants using the loop f4 and f5) are much faster than when not using the cache for storing generated random variables (f2 and f3 functions). I have shown both variants using and not using #simd for comparison.
EDIT
The comment by rafak is very good. Here are the benchmarks. As you can see there is still some difference, but much lower (as the most cost is generation of random numbers and not addition).
julia> function g1(rnd)
x = Vector{Float64}(undef, 10^6)
y = zeros(10^6)
for i in 1:100
rand!(rnd, x)
y .+= x
end
return y
end
g1 (generic function with 1 method)
julia> function g2(rnd)
y = zeros(10^6)
#inbounds for i in 1:100
#simd for j in 1:10^6
y[j] += rand(rnd)
end
end
return y
end
g2 (generic function with 1 method)
julia> function g3(rnd)
y = zeros(10^6)
#inbounds for i in 1:100
for j in 1:10^6
y[j] += rand(rnd)
end
end
return y
end
g3 (generic function with 1 method)
julia> using Random
julia> rnd = MersenneTwister();
julia> #btime g1($rnd);
168.874 ms (4 allocations: 15.26 MiB)
julia> #btime g2($rnd);
193.398 ms (2 allocations: 7.63 MiB)
julia> #btime g3($rnd);
192.320 ms (2 allocations: 7.63 MiB)
What my code does
The goal was to build a function, that checks if all brackets open and close correctly in a given string with julia. So,
"{abc()([[def]])()}"
should return true, while something like
"{(bracket order mixed up here!})[and this bracket doesn't close!"
should return false.
Question
I have two versions of the function. Why is version I faster by about 10%?
Version I
function matching_brackets_old(s::AbstractString)
close_open_map = Dict('}' => '{', ')' => '(', ']' => '[')
order_arr = []
for char in s
if char in values(close_open_map)
push!(order_arr, char)
elseif (char in keys(close_open_map)) &&
(isempty(order_arr) || (close_open_map[char] != pop!(order_arr)))
return false
end
end
return isempty(order_arr)
end
Version II
Here I replace the for loop with a do block:
function matching_brackets(s::AbstractString)
close_open_map = Dict('}' => '{', ')' => '(', ']' => '[')
order_arr = []
all_correct = all(s) do char
if char in values(close_open_map)
push!(order_arr, char)
elseif (char in keys(close_open_map)) &&
(isempty(order_arr) || (close_open_map[char] != pop!(order_arr)))
return false
end
return true
end
return all_correct && isempty(order_arr)
end
Timings
Using BenchmarkTools' #benchmark for the strings "{()()[()]()}" and "{()()[())]()}", I get a slow down up of about 10% for both strings, when comparing minimum execution time.
Additional Info
Version Info:
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin18.6.0)
CPU: Intel(R) Core(TM) i5-4260U CPU # 1.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, haswell)
Timing Code:
using BenchmarkTools
benchmark_strings = ["{()()[()]()}", "{()()[())]()}"]
for s in benchmark_strings
b_old = #benchmark matching_brackets_old("$s") samples=100000 seconds=30
b_new = #benchmark matching_brackets("$s") samples=100000 seconds=30
println("For String=", s)
println(b_old)
println(b_new)
println(judge(minimum(b_new), minimum(b_old)))
println("Result: ", matching_brackets(s))
end
With Result:
For String={()()[()]()}
Trial(8.177 μs)
Trial(9.197 μs)
TrialJudgement(+12.48% => regression)
Result: true
For String={()()[())]()}
Trial(8.197 μs)
Trial(9.202 μs)
TrialJudgement(+12.27% => regression)
Result: false
Edit
I mixed up the order on the Trialjudgement, so Version I is faster, as François Févotte suggests. My question remains: why?
Now that the mistake with judge is resolved, the answer is probably the usual caveat: function calls, as in this case resulting from the closure passed to all, are quite optimized, but not for free.
To get a real improvement, I suggest, other than making the stack type stable (which isn't that big a deal here), to get rid of the iterations you implicitely do by calling in on values and keys. It suffices to do that only once, without a dictionary:
const MATCHING_PAIRS = ('{' => '}', '(' => ')', '[' => ']')
function matching_brackets(s::AbstractString)
stack = Vector{eltype(s)}()
for c in s
for (open, close) in MATCHING_PAIRS
if c == open
push!(stack, c)
elseif c == close
if isempty(stack) || (pop!(stack) != open)
return false
end
end
end
end
return isempty(stack)
end
Even a bit more time can be squeezed out by unrolling the inner loop over the tuple:
function matching_brackets_unrolled(s::AbstractString)
stack = Vector{eltype(s)}()
for c in s
if (c == '(') || (c == '[') || (c == '{')
push!(stack, c)
elseif (c == ')')
if isempty(stack) || (pop!(stack) != '(')
return false
end
elseif (c == ']')
if isempty(stack) || (pop!(stack) != '[')
return false
end
elseif (c == '}')
if isempty(stack) || (pop!(stack) != '{')
return false
end
end
end
return isempty(stack)
end
This is somewhat ugly and certainly not nicely extendable, though. My benchmarks (matching_brackets_new is your second version, matching_brackets my first one):
julia> versioninfo()
Julia Version 1.3.1
Commit 2d5741174c (2019-12-30 21:36 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7 CPU 960 # 3.20GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, nehalem)
# NOT MATCHING
julia> #benchmark matching_brackets_new("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 784 bytes
allocs estimate: 16
--------------
minimum time: 674.844 ns (0.00% GC)
median time: 736.200 ns (0.00% GC)
mean time: 800.935 ns (6.54% GC)
maximum time: 23.831 μs (96.16% GC)
--------------
samples: 10000
evals/sample: 160
julia> #benchmark matching_brackets_old("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 752 bytes
allocs estimate: 15
--------------
minimum time: 630.743 ns (0.00% GC)
median time: 681.725 ns (0.00% GC)
mean time: 753.937 ns (6.41% GC)
maximum time: 23.056 μs (94.19% GC)
--------------
samples: 10000
evals/sample: 171
julia> #benchmark matching_brackets("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 164.883 ns (0.00% GC)
median time: 172.900 ns (0.00% GC)
mean time: 186.523 ns (4.33% GC)
maximum time: 5.428 μs (96.54% GC)
--------------
samples: 10000
evals/sample: 759
julia> #benchmark matching_brackets_unrolled("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 134.459 ns (0.00% GC)
median time: 140.292 ns (0.00% GC)
mean time: 150.067 ns (5.84% GC)
maximum time: 5.095 μs (96.56% GC)
--------------
samples: 10000
evals/sample: 878
# MATCHING
julia> #benchmark matching_brackets_old("{()()[()]()}")
BenchmarkTools.Trial:
memory estimate: 800 bytes
allocs estimate: 18
--------------
minimum time: 786.358 ns (0.00% GC)
median time: 833.873 ns (0.00% GC)
mean time: 904.437 ns (5.43% GC)
maximum time: 29.355 μs (96.88% GC)
--------------
samples: 10000
evals/sample: 106
julia> #benchmark matching_brackets_new("{()()[()]()}")
BenchmarkTools.Trial:
memory estimate: 832 bytes
allocs estimate: 19
--------------
minimum time: 823.597 ns (0.00% GC)
median time: 892.506 ns (0.00% GC)
mean time: 981.381 ns (5.98% GC)
maximum time: 47.308 μs (97.84% GC)
--------------
samples: 10000
evals/sample: 77
julia> #benchmark matching_brackets("{()()[()]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 206.062 ns (0.00% GC)
median time: 214.481 ns (0.00% GC)
mean time: 227.385 ns (3.38% GC)
maximum time: 6.890 μs (96.22% GC)
--------------
samples: 10000
evals/sample: 535
julia> #benchmark matching_brackets_unrolled("{()()[()]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 160.186 ns (0.00% GC)
median time: 164.752 ns (0.00% GC)
mean time: 180.794 ns (4.95% GC)
maximum time: 5.751 μs (97.03% GC)
--------------
samples: 10000
evals/sample: 800
Update: if you insert breaks in the first version, to really avoid unnecessary looping, the timings are almost indistinguishable, with nice code:
function matching_brackets(s::AbstractString)
stack = Vector{eltype(s)}()
for c in s
for (open, close) in MATCHING_PAIRS
if c == open
push!(stack, c)
break
elseif c == close
if isempty(stack) || (pop!(stack) != open)
return false
end
break
end
end
end
return isempty(stack)
end
with
julia> #benchmark matching_brackets_unrolled("{()()[())]()}")
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 137.574 ns (0.00% GC)
median time: 144.978 ns (0.00% GC)
mean time: 165.365 ns (10.44% GC)
maximum time: 9.344 μs (98.02% GC)
--------------
samples: 10000
evals/sample: 867
julia> #benchmark matching_brackets("{()()[())]()}") # with breaks
BenchmarkTools.Trial:
memory estimate: 112 bytes
allocs estimate: 2
--------------
minimum time: 148.255 ns (0.00% GC)
median time: 155.231 ns (0.00% GC)
mean time: 175.245 ns (9.62% GC)
maximum time: 9.602 μs (98.31% GC)
--------------
samples: 10000
evals/sample: 839
I don't observe the same on my machine: in my tests, version I is faster for both strings:
julia> versioninfo()
Julia Version 1.3.0
Commit 46ce4d7933 (2019-11-26 06:09 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i5-6200U CPU # 2.30GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
JULIA_PROJECT = #.
julia> #btime matching_brackets_old("{()()[()]()}")
716.443 ns (18 allocations: 800 bytes)
true
julia> #btime matching_brackets("{()()[()]()}")
761.434 ns (19 allocations: 832 bytes)
true
julia> #btime matching_brackets_old("{()()[())]()}")
574.847 ns (15 allocations: 752 bytes)
false
julia> #btime matching_brackets("{()()[())]()}")
612.793 ns (16 allocations: 784 bytes)
false
I would think (but this is a wild guess) that the difference between for loops and higher-order functions gets less and less significant when the string size increases.
However, I would encourage you to look more closely at the order_arr variable: as it is currently written, it is of type Vector{Any}, which - like any container of abstractly typed values - hurts performance. The following version performs better by concretely typing the elements of order_arr:
function matching_brackets_new(s::AbstractString)
close_open_map = Dict('}' => '{', ')' => '(', ']' => '[')
# Make sure the compiler knows about the type of elements in order_arr
order_arr = eltype(s)[] # or order_arr = Char[]
for char in s
if char in values(close_open_map)
push!(order_arr, char)
elseif (char in keys(close_open_map)) &&
(isempty(order_arr) || (close_open_map[char] != pop!(order_arr)))
return false
end
end
return isempty(order_arr)
end
yielding:
julia> #btime matching_brackets_new("{()()[()]()}")
570.641 ns (18 allocations: 784 bytes)
true
julia> #btime matching_brackets_new("{()()[())]()}")
447.758 ns (15 allocations: 736 bytes)
false
I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.